Implementation of an Attention layer where each head can attend to more than just one token, using coordinate descent to pick topk
MIT License
Implementation of Agent Attention in Pytorch
Exploring an idea where one forgets about efficiency and carries out attention across each edge o...
Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deforma...
Implementation of Make-A-Video, new SOTA text to video generator from Meta AI, in Pytorch
An implementation of local windowed attention for language modeling
Implementation of the conditionally routed attention in the CoLT5 architecture, in Pytorch
Some personal experiments around routing tokens to different autoregressive attention, akin to mi...
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attenti...
Implementation of the Equiformer, SE3/E3 equivariant attention network that reaches new SOTA, and...
A variant of Transformer-XL where the memory is updated not with a queue, but with attention
Implementation of Q-Transformer, Scalable Offline Reinforcement Learning via Autoregressive Q-Fun...
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning
An implementation of Performer, a linear attention-based transformer, in Pytorch
Implementation of Block Recurrent Transformer - Pytorch