LinearAttentionArena

Here we will test various linear attention designs.

APACHE-2.0 License

Stars

53

View Code on GitHub

Ecosystems: Python

LinearAttentionArena

Here we will test various linear attention designs.

pip install pytorch-lightning==1.9.5 torch deepspeed wandb ninja --upgrade

RWKV-6.0b differences (vs RWKV-6.0): GroupNorm => LayerNorm, and remove "gate" in TimeMix, so the params count is lower.

# Example: RWKV-6.0b L12-D768 (189M params) on 4x4090, minipile 1.5B tokens loss 2.812

./prepare.sh --model_type "x060b" --layer 12 --emb 768 --ctx_len 512 --suffix "-0"

./train.sh --model_type "x060b" --layer 12 --emb 768 --lr_init "6e-4" --lr_final "2e-4" --ctx_len 512 --n_gpu 4 --m_bsz 32 --grad_cp 0 --save_period 1000 --suffix "-0"

# Example: Mamba L12-D768 (191M params) on 4x4090, minipile 1.5B tokens loss 2.885

./prepare.sh --model_type "mamba" --layer 12 --emb 768 --ctx_len 512 --suffix "-0"

./train.sh --model_type "mamba" --layer 12 --emb 768 --lr_init "6e-4" --lr_final "2e-4" --ctx_len 512 --n_gpu 4 --m_bsz 32 --grad_cp 0 --save_period 1000 --suffix "-0"

Related Projects

sosp21_exp

MPL_Lightning

Lightning implementation of Meta Pseudo Label

nanotron

Minimalistic large language model 3D-parallelism training

11 Sep 2023 1,080

g-mlp-pytorch

Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

18 May 2021 422

RWKV-v2-RNN-Pile

RWKV-v2-RNN trained on the Pile. See https://github.com/BlinkDL/RWKV-LM for details.

SmallInitEmb

LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

FLASH-pytorch

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

28 Mar 2022 345

essential-BYOL

An essential implementation of BYOL in PyTorch + PyTorch Lightning

kronecker-attention-pytorch

Implementation of Kronecker Attention in Pytorch

CO2A

minimal-gpt-neox-20b

09 Mar 2022 126

PyDyNet

NumPy实现类PyTorch的动态计算图和神经网络框架(MLP, CNN, RNN, Transformer)

msa-transformer

MSA Transformer reproduction code

point-transformer-pytorch

Implementation of the Point Transformer layer, in Pytorch

18 Dec 2020 587

dilated-attention-pytorch

(Unofficial) Implementation of dilated attention from "LongNet: Scaling Transformers to 1,000,000...