minbpe-pytorch

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization, with PyTorch/CUDA

MIT License

Stars

29

View Code on GitHub View on X

Ecosystems: Python

minbpe-pytorch

Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.

This adds PyTorch/CUDA training and encoding support to Andrej Karpathy's minbpe. It takes 67.4 seconds on an H100 with SXM5 to train the BasicTokenizer with a vocab_size of 512 on 308MB of Enron emails. The original code takes 2hrs 15min on an M2 Air with Python 3.11 to do this. That is a 120x speedup.

quick start

Install requirements:

$ pip install -r requirements.txt

Download Enron emails and save a 308MB text file to tests/enron.txt:

$ python get_enron_emails.py

Train a BasicTokenizer on the large text file:

$ python train.py

The model will be saved in the models directory.

tests

The pytest library is used for tests. All of them are located in the tests/ directory. First pip install pytest, then:

$ pytest -v .

todo

Speed up encode method for RegexTokenizer
Implement train method for RegexTokenizer
Support MPS device for MacBooks, currently breaks for torch.unique

License

MIT

Related Projects

minimal-opt

Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

02 Jul 2021 1,323

korean-spacing-model

한국어 문장 띄어쓰기(삭제/추가) 모델입니다. 데이터 준비 후 직접 학습이 가능하도록 작성하였습니다.

subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

01 Sep 2015 2,146

electra-pytorch

A simple and working implementation of Electra, the fastest way to pretrain language models from ...

04 Aug 2020 222

usc_dae

Repository for Unsupervised Sentence Compression using Denoising Auto-Encoders

sosp21_exp

minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

16 Feb 2024 9,074

MAE-pytorch

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

13 Nov 2021 2,591

nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.

28 Dec 2022 32,417

BertSum

Code for paper Fine-tune BERT for Extractive Summarization

25 Mar 2019 1,464

starcoder

Home of StarCoder: fine-tuning & inference!

24 Apr 2023 7,267

starcoder2

Home of StarCoder2!

08 Dec 2023 1,732

bpe-summarizer

Auto summarization from BPE tokenization

minimal-gpt-neox-20b

09 Mar 2022 126