Open Source Ecosystems

Quartic Transformer (wip)

Exploring an idea where one forgets about efficiency and carries out attention on each edge of the nodes (tokens). You can think of it as doing attention on the attention matrix, taking the perspective of the attention matrix as all the directed edges of a fully connected graph.

The hypothesis is that there is a task out there that the (sub)quartic transformer can do that quadratic transformers cannot.

Will also contain a modified implementation of multistream transformer (which is not quartic, but number of streams times the quadratic).

Appreciation

A16Z Open Source AI Grant Program and 🤗 Huggingface for the generous sponsorships, as well as my other sponsors, for affording me the independence to open source current artificial intelligence research

Install

$ pip install quartic-transformer

Usage

import torch
from quartic_transformer import QuarticTransformer

model = QuarticTransformer(
    num_tokens = 256,
    depth = 2,
    dim = 512,
    dim_edges = 32
)

tokens = torch.randint(0, 256, (1, 128))

logits = model(tokens) # (1, 128, 256)

Todo

first add a weak taylor linear attention on top of all edges
use coordinate descent routing from the node attention matrix to select a subset of edges to update (and do full attention across)
build multi-stream transformer, but allow exchange of information at the attention matrix, either through residual attention or a small edge-wise feedforward

Citation

@inproceedings{Keles2022OnTC,
    title   = {On The Computational Complexity of Self-Attention},
    author  = {Feyza Duman Keles and Pruthuvi Maheshakya Wijewardena and Chinmay Hegde},
    booktitle = {International Conference on Algorithmic Learning Theory},
    year    = {2022},
    url     = {https://api.semanticscholar.org/CorpusID:252198880}
}

@article{Burtsev2021MultiStreamT,
    title   = {Multi-Stream Transformers},
    author  = {Mikhail S. Burtsev and Anna Rumshisky},
    journal = {ArXiv},
    year    = {2021},
    volume  = {abs/2107.10342},
    url     = {https://api.semanticscholar.org/CorpusID:236171087}
}

@misc{Sutton,
    title  = {The Bitter Lesson},
    url    = {http://www.incompleteideas.net/IncIdeas/BitterLesson.html},
    author = {Sutton, Rich}
}

@article{Shazeer2020TalkingHeadsA,
    title   = {Talking-Heads Attention},
    author  = {Noam M. Shazeer and Zhenzhong Lan and Youlong Cheng and Nan Ding and Le Hou},
    journal = {ArXiv},
    year    = {2020},
    volume  = {abs/2003.02436},
    url     = {https://api.semanticscholar.org/CorpusID:212414717}
}

Package Rankings

Top 37.54% on Pypi.org

Related Projects

mixture-of-attention

Some personal experiments around routing tokens to different autoregressive attention, akin to mi...

21 Apr 2023 101

ring-attention-pytorch

Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch

14 Feb 2024 457

performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch

03 Oct 2020 1,084

local-attention

An implementation of local windowed attention for language modeling

05 Jul 2020 375

equiformer-pytorch

Implementation of the Equiformer, SE3/E3 equivariant attention network that reaches new SOTA, and...

29 Oct 2022 242

taylor-series-linear-attention

Explorations into the recently proposed Taylor Series Linear Attention

23 Dec 2023 88

memory-transformer-xl

A variant of Transformer-XL where the memory is updated not with a queue, but with attention

10 Jul 2020 45

complex-valued-transformer

Implementation of the transformer proposed in "Building Blocks for a Complex-Valued Transformer A...

06 Oct 2023 57

simple-hierarchical-transformer

Experiments around a simple idea for inducing multiple hierarchical predictive model within a GPT

06 Apr 2023 204

nuwa-pytorch

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

28 Nov 2021 540

coordinate-descent-attention

Implementation of an Attention layer where each head can attend to more than just one token, usin...

31 Mar 2023 46

En-transformer

Implementation of E(n)-Transformer, which incorporates attention mechanisms into Welling's E(n)-E...

27 Feb 2021 208

q-transformer

Implementation of Q-Transformer, Scalable Offline Reinforcement Learning via Autoregressive Q-Fun...

20 Sep 2023 338

h-transformer-1d

Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning

28 Jul 2021 153

agent-attention-pytorch

Implementation of Agent Attention in Pytorch

18 Dec 2023 85