minGPT-flax

GPT implementation in Flax

Stars

16

View Code on GitHub

Ecosystems: Python

minGPT-flax

A basic transformer implementation, for seq2seq modeling in Flax/JAX. Written for educational purposes 🏫.

Also includes some bells and whistles:

Data parallelism. By default, we train on all available GPUs. This can
massively speed up training, even on smaller batch sizes.
Chunked self-attention, adapted from the approach described by Markus Rabe and
Charles Staats [1]. This makes a small runtime tradeoff to avoid the quadratic
memory constraint of standard self-attention.

[1] https://arxiv.org/pdf/2112.05682v2.pdf

Usage

Install:

pip install -r requirements.txt

Train:

$ python train_char.py --help

usage: train_char.py [-h] --dataset-path PATH [--experiment-name STR] [--restore-checkpoint] [--max-epochs INT]
                     [--minibatch-size INT] [--block-size INT] [--gpt-config.vocab-size INT] [--gpt-config.block-size INT]
                     [--gpt-config.n-head INT] [--gpt-config.resid-pdrop FLOAT] [--gpt-config.attn-pdrop FLOAT]
                     [--gpt-config.chunk-attention] [--gpt-config.q-chunk-size INT] [--gpt-config.kv-chunk-size INT]
                     [--gpt-config.n-layer INT] [--gpt-config.embd-dim INT] [--gpt-config.embd-pdrop FLOAT]
                     [--optimizer-config.learning-rate FLOAT] [--optimizer-config.no-lr-decay]
                     [--optimizer-config.adam-b1 FLOAT] [--optimizer-config.adam-b2 FLOAT]
                     [--optimizer-config.warmup-tokens INT] [--optimizer-config.final-tokens INT]
                     [--optimizer-config.weight-decay FLOAT] [--optimizer-config.grad-norm-clip FLOAT]

required arguments:
  --dataset-path PATH   Path to a text file, to be loaded for training. Needs to fit in memory.

optional arguments:
  -h, --help            show this help message and exit
  --experiment-name STR
                        (default: char_2022-01-07-18:01:54)
  --restore-checkpoint
  --max-epochs INT      (default: 1000)
  --minibatch-size INT  (default: 128)
  --block-size INT      (default: 128)
  --gpt-config.vocab-size INT
                        (default: 256)
  --gpt-config.block-size INT
                        The history/context length of our sequence model. (default: 128)
  --gpt-config.n-head INT
                        Output size for multi-headed self-attention. (default: 8)
  --gpt-config.resid-pdrop FLOAT
                        Dropout probability. (default: 0.1)
  --gpt-config.attn-pdrop FLOAT
                        Dropout probability. (default: 0.1)
  --gpt-config.chunk-attention
                        Enable attention chunking to trade runtime for memory efficiency. We implement an
                        approach similar to the algorithm presented here:
                        https://arxiv.org/pdf/2112.05682v2.pdf

                        If chunking is enabled, both q_chunk_size and kv_chunk_size must be set.
                        Note that `block_size % chunk_size` must be 0 for both chunk sizes.
  --gpt-config.q-chunk-size INT
                        (default: None)
  --gpt-config.kv-chunk-size INT
                        (default: None)
  --gpt-config.n-layer INT
                        (default: 8)
  --gpt-config.embd-dim INT
                        (default: 512)
  --gpt-config.embd-pdrop FLOAT
                        Dropout probability. (default: 0.1)
  --optimizer-config.learning-rate FLOAT
                        (default: 0.0006)
  --optimizer-config.no-lr-decay
                        If decay is enabled, we use cosine annealing.
  --optimizer-config.adam-b1 FLOAT
                        (default: 0.9)
  --optimizer-config.adam-b2 FLOAT
                        (default: 0.95)
  --optimizer-config.warmup-tokens INT
                        Tokens before reaching full learning rate. (default: 10240)
  --optimizer-config.final-tokens INT
                        At what point we reach 10% of original LR (default: 2560)
  --optimizer-config.weight-decay FLOAT
                        L2 regularization coefficient. (default: 0.1)
  --optimizer-config.grad-norm-clip FLOAT
                        (default: 1.0)

As an example, to train with self-attention chunk sizes of 64:

$ python train_char.py --dataset-path ./some_text_file --gpt-config.chunk-attention --gpt-config.q-chunk-size 64 --gpt-config.kv-chunk-size 64

The training script will attempt to use all available GPUs; CUDA_VISIBLE_DEVICES may be helpful if this is undesired.

Eval (sampling):

$ python eval_char.py

usage: eval_char.py [-h] --experiment-name STR [--sample-steps INT] [--sample-from-top-k INT]

required arguments:
  --experiment-name STR

optional arguments:
  -h, --help            show this help message and exit
  --sample-steps INT    (default: 500)
  --sample-from-top-k INT

Links

Third-party:

The core model implementation details are based off of
karpathy/minGPT (PyTorch).
The learning rate scheduler is adapted from
mgrankin/minGPT (Haiku).
matthias-wright/flaxmodels
also has pretrained models implemented using Flax.

This repo also serves as a testbed for a few "core infrastructure" libraries that I've been working on, including:

fifteen, which contains utilities for
training: data loading, experiment management, etc.
dcargs, which is used for unifying
experiment configuration with type-safe argument parsing.
jax_dataclasses, which is used
to construct type-safe PyTree structures.

To-do list

Model
- Config classes
- Masked self-attention
  - Naive version.
  - Chunked version. https://arxiv.org/pdf/2112.05682v2.pdf
- Transformer + transformer blocks
Training boilerplate
- Minimal training loop
- Weight decay masking
- Learning rate scheduling
- Tensorboard logging
- Checkpointing
- Multi-GPU support
Demos
- Character-level language model
  - Training script
```
python train_char.py --help
```
  - Sampling/rollout demo
```
python eval_char.py --help
```
- BPE-based language model
Tangentially related reach goals
- Vision transformer
- Transformer w/ Perceiver IO-style latent vectors?
- Weight loading from OpenAI's released model?

Related Projects

minGPT

A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training

17 Aug 2020 19,926

Megatron-LM

Ongoing research training transformer models at scale

21 Mar 2019 8,839

ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

06 Feb 2019 1,226

VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deploy...

23 Feb 2024 1,061

the-incredible-pytorch

The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relat...

11 Feb 2017 11,389

peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.

25 Nov 2022 15,987

long_llama

LongLLaMA is a large language model capable of handling long contexts. It is based on OpenLLaMA a...

06 Jul 2023 1,448

sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification

30 Nov 2017 1,061

Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

02 Jul 2021 1,323

transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

29 Oct 2018 132,459

GLM

GLM (General Language Model)

18 Mar 2021 3,170

flax

Flax is a neural network library for JAX that is designed for flexibility.

10 Jan 2020 5,584

FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vi...

19 Mar 2023 36,628

LongLoRA

Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)

21 Sep 2023 2,607

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating poin...

20 Sep 2022 1,482