Open Source Ecosystems

Coordinate Descent Attention (wip)

Implementation of an Attention layer where each head can attend to more than just one token, using coordinate descent to pick topk. Perhaps the number of tokens to attend to can even be learned.

In the case that experiments above fail, will use the repo for a few other ideas, among them getting coordinate descent routing working for autoregressive transformers.

Ongoing experiments

Update: I don't think the improvements are worth it. The memory usage becomes impractical as the number of iterations goes up as well. I'll keep playing around with topk attention though, because it bothers me that softmax becomes a bottleneck for the tokens far in the future, especially as sequence lengths go above 8k

Update: Using a kernel written in Triton, it is a bit more viable, but still too much if number of iterations is high

Update: by doing recomputes in segments of iterations, now feasible, if it were to actually yields any improvements

Appreciation

StabilityAI for the sponsorship to carry out independent research

Install

$ pip install coordinate-descent-attention

Usage

import torch
from coordinate_descent_attention import Transformer

model = Transformer(
    num_tokens = 256,
    dim = 512,
    depth = 2,
    seq_len = 2048,
    dim_head = 64,
    heads = 8,
    attn_use_coor_descent = True   # set to True to switch from softmax to coordinate descent on qk similarity matrix
).cuda()

x = torch.randint(0, 256, (1, 2048)).cuda()

logits = model(x)

Todo

let the network control sparsity k
try coordinate descent with a few set sparsity levels for the hidden layer of the feedforward
ablate with topk attention, make sure it isn't because of hard attention
try using coordinate descent routing on low rank attention heads, route from high rank

Citations

@article{Wright2015CoordinateDA,
    title   = {Coordinate descent algorithms},
    author  = {Stephen J. Wright},
    journal = {Mathematical Programming},
    year    = {2015},
    volume  = {151},
    pages   = {3-34}
}

@inproceedings{Gupta2021MemoryefficientTV,
    title   = {Memory-efficient Transformers via Top-k Attention},
    author  = {Ankit Gupta and Guy Dar and Shaya Goodman and David Ciprut and Jonathan Berant},
    booktitle = {SUSTAINLP},
    year    = {2021}
}

@article{Zhao2019ExplicitST,
    title   = {Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection},
    author  = {Guangxiang Zhao and Junyang Lin and Zhiyuan Zhang and Xuancheng Ren and Qi Su and Xu Sun},
    journal = {ArXiv},
    year    = {2019},
    volume  = {abs/1912.11637}
}

@article{Schmitzer2016StabilizedSS,
    title   = {Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems},
    author  = {Bernhard Schmitzer},
    journal = {ArXiv},
    year    = {2016},
    volume  = {abs/1610.06519}
}

Related Projects

make-a-video-pytorch

Implementation of Make-A-Video, new SOTA text to video generator from Meta AI, in Pytorch

29 Sep 2022 1,852

memory-efficient-attention-pytorch

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attenti...

03 Mar 2022 356

h-transformer-1d

Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning

28 Jul 2021 153

quartic-transformer

Exploring an idea where one forgets about efficiency and carries out attention across each edge o...

03 Feb 2024 43

ring-attention-pytorch

Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch

14 Feb 2024 457

local-attention

An implementation of local windowed attention for language modeling

05 Jul 2020 375

block-recurrent-transformer-pytorch

Implementation of Block Recurrent Transformer - Pytorch

07 Feb 2023 212

memory-transformer-xl

A variant of Transformer-XL where the memory is updated not with a queue, but with attention

10 Jul 2020 45

performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch

03 Oct 2020 1,084

mixture-of-attention

Some personal experiments around routing tokens to different autoregressive attention, akin to mi...

21 Apr 2023 101

agent-attention-pytorch

Implementation of Agent Attention in Pytorch

18 Dec 2023 85

q-transformer

Implementation of Q-Transformer, Scalable Offline Reinforcement Learning via Autoregressive Q-Fun...

20 Sep 2023 338

deformable-attention

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deforma...

17 Mar 2022 275

CoLT5-attention

Implementation of the conditionally routed attention in the CoLT5 architecture, in Pytorch

20 Mar 2023 223

equiformer-pytorch

Implementation of the Equiformer, SE3/E3 equivariant attention network that reaches new SOTA, and...

29 Oct 2022 242