Exploring an idea where one forgets about efficiency and carries out attention across each edge of the nodes (tokens)
MIT License
A variant of Transformer-XL where the memory is updated not with a queue, but with attention
Implementation of the transformer proposed in "Building Blocks for a Complex-Valued Transformer A...
Implementation of the Equiformer, SE3/E3 equivariant attention network that reaches new SOTA, and...
An implementation of Performer, a linear attention-based transformer, in Pytorch
Implementation of Agent Attention in Pytorch
Some personal experiments around routing tokens to different autoregressive attention, akin to mi...
Implementation of an Attention layer where each head can attend to more than just one token, usin...
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
An implementation of local windowed attention for language modeling
Explorations into the recently proposed Taylor Series Linear Attention
Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch
Implementation of E(n)-Transformer, which incorporates attention mechanisms into Welling's E(n)-E...
Implementation of Q-Transformer, Scalable Offline Reinforcement Learning via Autoregressive Q-Fun...
Experiments around a simple idea for inducing multiple hierarchical predictive model within a GPT
Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning