st-moe-pytorch

Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorch

MIT License

Downloads

1.5K

Stars

285

Committers

View Code on GitHub

Ecosystems: Python

ST-MoE - Pytorch

Implementation of ST-MoE, the latest incarnation of mixture of experts after years of research at Brain, in Pytorch. Will be largely a transcription of the official Mesh Tensorflow implementation. If you have any papers you think should be added, while I have my attention on mixture of experts, please open an issue.

This should be SOTA for mixture-of-experts for autoregressive transformers. It is rumored that GPT4 is using 16 experts with top2 gating.

For non-autoregressive, would recommend going with the simpler and better Soft MoE.

Install

$ pip install st-moe-pytorch

Appreciation

StabilityAI for the generous sponsorship, as well as my other sponsors, for affording me the independence to open source artificial intelligence.
Aran Komatsuzaki for consultation on mixture-of-experts, for removal of 2-level MoE and simplifications to code

Usage

import torch
from st_moe_pytorch import MoE

moe = MoE(
    dim = 512,
    num_experts = 16,               # increase the experts (# parameters) of your model without increasing computation
    gating_top_n = 2,               # default to top 2 gating, but can also be more (3 was tested in the paper with a lower threshold)
    threshold_train = 0.2,          # at what threshold to accept a token to be routed to second expert and beyond - 0.2 was optimal for 2 expert routing, and apparently should be lower for 3
    threshold_eval = 0.2,
    capacity_factor_train = 1.25,   # experts have fixed capacity per batch. we need some extra capacity in case gating is not perfectly balanced.
    capacity_factor_eval = 2.,      # capacity_factor_* should be set to a value >=1
    balance_loss_coef = 1e-2,       # multiplier on the auxiliary expert balancing auxiliary loss
    router_z_loss_coef = 1e-3,      # loss weight for router z-loss
)

inputs = torch.randn(4, 1024, 512)
out, total_aux_loss, balance_loss, router_z_loss = moe(inputs) # (4, 1024, 512), (1,), (1,), (1,)

# for the entire mixture of experts block, in context of transformer

from st_moe_pytorch import SparseMoEBlock

moe_block = SparseMoEBlock(
    moe,
    add_ff_before = True,
    add_ff_after = True
)

out, total_aux_loss, balance_loss, router_z_loss = moe_block(inputs) # (4, 1024, 512), (1,) (1,), (1,)

# the total auxiliary loss will need to be summed and then added to the main loss

# the other two losses are the unweighted breakdown for logging purposes

Todo

add the router z-loss proposed in paper
add the geglu expert with multiplicative gating
add an entire sparse moe block, complete with rmsnorm + residual as well as the ability to specify a feedforward before or after for stability
double check equation for router z-loss for experts inner in hierarchical moe
redo all the transcribed code from google with einops, as it is not very clear
consult some MoE experts in the open source community; question why hierarchical MoE is needed, in light of results from soft-MoE
offer top-n gating generalization, as it seems top3 (with smaller threshold) can work even better
figure out if there was an error in a previous transcription - no there was not an error
allow for different thresholds for second vs third routed expert
add coordinate descent based routing
make first naive non-optimized attempt at distributed code for mixture of experts
distributed
- handle any world size less than number of experts
- handle any world size greater than number of experts - for now, just have remainder machines do nothing
- support variable batch sizes
- support variable seq lengths
- figure out how to move assert.py to pytests
- simplify the variable sequence length test code from another folder and move in so other researchers gain confidence
- optimize
- figure out what is faster, all gather, or broadcast with async followed by barrier
- make all distributed code pluggable, for different strategies
- figure out why there is tiny error in gradients
improvise a Top2GatingWithCoordinateDescent for MoE without importance

Citations

@inproceedings{Zoph2022STMoEDS,
    title   = {ST-MoE: Designing Stable and Transferable Sparse Expert Models},
    author  = {Barret Zoph and Irwan Bello and Sameer Kumar and Nan Du and Yanping Huang and Jeff Dean and Noam M. Shazeer and William Fedus},
    year    = {2022}
}

Package Rankings

Top 21.44% on Pypi.org

Related Projects

MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models

14 Dec 2023 1,932

PEER-pytorch

Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen...

09 Jul 2024 109

diffusers-torchao

End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 ...

05 Aug 2024 213

megablocks

MegaBlocks

26 Jan 2023 1,076

minichatgpt

minichatgpt - To Train ChatGPT In 5 Minutes

23 Feb 2023 155

mixture-of-attention

Some personal experiments around routing tokens to different autoregressive attention, akin to mi...

21 Apr 2023 101

the-incredible-pytorch

The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relat...

11 Feb 2017 11,389

soft-moe-pytorch

Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch

04 Aug 2023 239

long_llama

LongLLaMA is a large language model capable of handling long contexts. It is based on OpenLLaMA a...

06 Jul 2023 1,448

OpenMoE

A family of open-sourced Mixture-of-Experts (MoE) Large Language Models

08 Aug 2023 1,368

diffengine

Diffusers training with mmengine

sinkhorn-router-pytorch

Self contained pytorch implementation of a sinkhorn based router, for mixture of experts or other...

fastmoe

A fast MoE impl for PyTorch

25 Jan 2021 1,534

mixture-of-experts

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the param...

13 Jul 2020 624

MoA

Together Mixture-Of-Agents (MoA) – 65.1% on AlpacaEval with OSS models

04 Jun 2024 2,554