Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization, with PyTorch/CUDA
MIT License
Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.
This adds PyTorch/CUDA training and encoding support to Andrej Karpathy's minbpe. It takes 67.4 seconds on an H100 with SXM5 to train the BasicTokenizer
with a vocab_size of 512 on 308MB of Enron emails. The original code takes 2hrs 15min on an M2 Air with Python 3.11 to do this. That is a 120x speedup.
Install requirements:
$ pip install -r requirements.txt
Download Enron emails and save a 308MB text file to tests/enron.txt
:
$ python get_enron_emails.py
Train a BasicTokenizer
on the large text file:
$ python train.py
The model will be saved in the models
directory.
The pytest library is used for tests. All of them are located in the tests/
directory. First pip install pytest
, then:
$ pytest -v .
encode
method for RegexTokenizer
train
method for RegexTokenizer
torch.unique
MIT