Definition
A subword tokenization algorithm that builds a vocabulary by iteratively adding the most likely sequences of characters. It uses a likelihood-based approach to determine optimal subword units.
Detailed Explanation
WordPiece starts with a base vocabulary of individual characters and iteratively adds new tokens by joining character sequences that maximize the likelihood of the training data. It uses a probability-based scoring mechanism to select which pieces to merge differing from BPE's frequency-based approach. The algorithm was initially developed by Google for neural machine translation.
Use Cases
BERT models Machine translation Text preprocessing Multilingual models