Open Source Ecosystems

AdamW optimizer for bfloat16 models in pytorch.

Bfloat16 is currently an optimal tradeoff between range and relative error for deep networks.
Bfloat16 can be used quite efficiently on Nvidia GPUs with Ampere architecture (H100, A100, A10, A30, RTX3090...)

However, usage of bfloat16 in torch ecosystem is ... awkward (torch AMP is very non-transparent, and was initially developed with a focus on fp16, which is totally different from bf16).

Problem of stale weights

If you just convert all weights and inputs to bfloat16, you're likely to run into an issue of stale weights: updates are too small to modify bfloat16 weight (see gopher paper, section C2 for a large-scale example).

There are two possible remedies:

keep weights in float32 (precise) and bfloat16 (approximate)
keep weights in bfloat16, and keep correction term in bfloat16

As recent study has shown, both options are completely competitive in quality to float32 training. That's what we implement here in a convenient wrapper.

Installation

pip install git+https://github.com/arogozhnikov/adamw_bfloat16.git

Usage

Use as a drop-in replacement for pytorch's AdamW:

import torch
from adamw_bfloat16 import LR, AdamW_BF16
model = model.to(torch.bfloat16)

# default preheat and decay
optimizer = AdamW_BF16(model.parameters())

# configure LR schedule. Use built-in scheduling opportunity
optimizer = AdamW_BF16(model.parameters(), lr_function=LR(lr=1e-4, preheat_steps=5000, decay_power=-0.25))

# in the loop:
loss.backward()
optimizer.step()
optimizer.zero_grad()

Or you can even replace last two lines with one:

optimizer.step(zero_grad=True)

This optimizer simplifies the code by removing:

grad scaler
amp/autocast
you can just forget about float32 computations
lr scheduler (also no need to .step() scheduler)

Uses ~25% less memory per parameter compared to built-in AdamW.

Related Projects

lion-pytorch

🦁 Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly bet...

15 Feb 2023 2,018

ml-design-patterns

Software Architecture for ML engineers

14 Jun 2021 373

lion-tf

A TensorFlow implementation of the Lion optimizer

16 Feb 2023 10

FlexiGen

Running large language models on a single GPU for throughput-oriented scenarios.

15 Feb 2023 9,156

ali-pytorch

Adversarially Learned Inference in Pytorch

24 Mar 2017 30

Complex-YOLOv4-Pytorch

The PyTorch Implementation based on YOLOv4 of the paper: "Complex-YOLO: Real-time 3D Object Detec...

03 Jul 2020 1,234

minimal-opt

08 Jul 2022 58

Adan-pytorch

Implementation of the Adan (ADAptive Nesterov momentum algorithm) Optimizer in Pytorch

25 Aug 2022 247

esgd

ESGD-M is a stochastic non-convex second order optimizer, suitable for training deep learning mod...

22 Jan 2022 56

quantized-training

Explore training for quantized models

16 Jul 2024 7

minimal-gpt-neox-20b

09 Mar 2022 126

MAE-pytorch

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

13 Nov 2021 2,591

smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

17 Nov 2022 1,199

GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

07 Mar 2024 1,179

diffusers-torchao

End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 ...

05 Aug 2024 166