Byte pair encoding (BPE)
:material-circle-edit-outline: ็บฆ 144 ไธชๅญ
The original version of this algorithem focuses on compression. It replaces the highest-frequency pair of bytes with a new byte that was not contained in the initial dataset. A lookup table of the replacement is required to rebuild the initial dataset. The modified version builds tokens that match varying amount of source text from single characters to whole words.
Example
Suppose the sentence to be
Since the byte pair (each alphabet as a byte) 'aa' occurs the most often, it is replaced by a byte 'Z' (first match).
Then 'ab' with 'Y'
Then 'ZY' with X
Then this sequence can not be further compressed since there are no byte pair appearing more than once.
This algorithm is effective for tokenization because it has low computational overhead and remains consistent and reliable.
References