Implementation of Mega, the Single-head Attention with Multi-headed EMA architecture that currently holds SOTA on Long Range Arena
MIT License
GPT, but made only out of MLPs
Some personal experiments around routing tokens to different autoregressive attention, akin to mi...
Implementation of MEGABYTE, Predicting Million-byte Sequences with Multiscale Transformers, in Py...
Implementation of OmniNet, Omnidirectional Representations from Transformers, in Pytorch
Implementation of the Point Transformer layer, in Pytorch
(Unofficial) Implementation of dilated attention from "LongNet: Scaling Transformers to 1,000,000...
A simple cross attention that updates both the source and target in one step
Implementation of Perceiver AR, Deepmind's new long-context attention network based on Perceiver ...
Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attenti...
Implementation of Agent Attention in Pytorch
MSA Transformer reproduction code
Implementation of CALM from the paper "LLM Augmented LLMs: Expanding Capabilities through Composi...
Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch
Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"