Open Source Ecosystems

Soft MoE - Pytorch

Implementation of Soft MoE (Mixture of Experts), proposed by Brain's Vision team, in Pytorch.

This MoE has only been made to work with non-autoregressive encoder. However, some recent text-to-image models have started using MoE with great results, so may be a fit there.

If anyone has any ideas for how to make it work for autoregressive, let me know (through email or discussions). I meditated on it but can't think of a good way. The other issue with the slot scheme is that the routing suffers the quadratic as sequence length increases (much like attention)

Appreciation

StabilityAI for the generous sponsorship, as well as my other sponsors out there
Einops for making my life easy

Install

$ pip install soft-moe-pytorch

Usage

import torch
from soft_moe_pytorch import SoftMoE

moe = SoftMoE(
    dim = 512,         # model dimensions
    seq_len = 1024,    # max sequence length (will automatically calculate number of slots as seq_len // num_experts) - you can also set num_slots directly
    num_experts = 4    # number of experts - (they suggest number of experts should be high enough that each of them get only 1 slot. wonder if that is the weakness of the paper?)
)

x = torch.randn(1, 1024, 512)

out = moe(x) + x # (1, 1024, 512) - add in a transformer in place of a feedforward at a certain layer (here showing the residual too)

For an improvised variant that does dynamic slots so that number of slots ~= sequence length, just import DynamicSlotsSoftMoe instead

import torch
from soft_moe_pytorch import DynamicSlotsSoftMoE

# sequence length or number of slots need not be specified

moe = DynamicSlotsSoftMoE(
    dim = 512,         # model dimensions
    num_experts = 4,   # number of experts
    geglu = True
)

x = torch.randn(1, 1023, 512)

out = moe(x) + x # (1, 1023, 512)

Todo

address the limitation of number of slots being fixed. think about a way to make dynamic number of slots based on sequence length
once variable sequence length is handled in distributed, add to dynamic soft moe
the dispatch and combine tensors can also be split and moved into the Experts class to better distribute work

Citations

@misc{puigcerver2023sparse,
    title 	= {From Sparse to Soft Mixtures of Experts}, 
    author 	= {Joan Puigcerver and Carlos Riquelme and Basil Mustafa and Neil Houlsby},
    year 	= {2023},
    eprint 	= {2308.00951},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

@misc{shazeer2020glu,
    title   = {GLU Variants Improve Transformer},
    author  = {Noam Shazeer},
    year    = {2020},
    url     = {https://arxiv.org/abs/2002.05202}
}

Package Rankings

Top 25.27% on Pypi.org

Related Projects

PEER-pytorch

Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen...

09 Jul 2024 109

MEGABYTE-pytorch

Implementation of MEGABYTE, Predicting Million-byte Sequences with Multiscale Transformers, in Py...

15 May 2023 620

mirasol-pytorch

Implementation of 🌻 Mirasol, SOTA Multimodal Autoregressive model out of Google Deepmind, in Pytorch

18 Nov 2023 84

mixture-of-experts

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the param...

13 Jul 2020 624

fastmoe

A fast MoE impl for PyTorch

25 Jan 2021 1,534

OpenMoE

A family of open-sourced Mixture-of-Experts (MoE) Large Language Models

08 Aug 2023 1,368

sinkhorn-router-pytorch

Self contained pytorch implementation of a sinkhorn based router, for mixture of experts or other...

23 Aug 2024 31

megablocks

MegaBlocks

26 Jan 2023 1,076

g-mlp-pytorch

Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

18 May 2021 422

feedback-transformer-pytorch

Implementation of Feedback Transformer in Pytorch

02 Feb 2021 104

meshgpt-pytorch

Implementation of MeshGPT, SOTA Mesh generation using Attention, in Pytorch

29 Nov 2023 642

st-moe-pytorch

Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorch

26 Mar 2023 285

pytorch-widedeep

A flexible package for multimodal-deep-learning to combine tabular data with text and images usin...

21 Oct 2017 1,243

slot-attention

Implementation of Slot Attention from GoogleAI

29 Jun 2020 384

mixture-of-attention

Some personal experiments around routing tokens to different autoregressive attention, akin to mi...

21 Apr 2023 101