Open Source Ecosystems

The repository contains a modular Python implementation of transformer architectures for natural language understanding and generation tasks, according to:

The seminal paper Attention Is All You Need by Vaswani et al.[1] that details the novel attention-based transformer architecture and its application to sequence-to-sequence tasks, demonstrating its effectiveness by achieving state-of-the-art performance in machine translation, surpassing previous LSTM and CNN based neural machine translation architectures.
The chapter on Transformers and Large Language Models from Speech and Language Processing by Jurafsky & Martin[2] which provides a more comprehensive and illustrative look into some of the high-level details discussed in Attention Is All You Need.

Features

Generic encoder-only, decoder-only and encoder-decoder transformer architectures.
Wrappers for causal language modelling, sequence-to-sequence generation and classification/regression.
Various decoding methods for causal/sequence-to-sequence generation:
- Search-based (greedy and beam search)
- Sampling-based (nucleus, temperature and top-k sampling)
Example applications to real-world datasets.

PyTorch restrictions

This project is implemented using PyTorch and PyTorch Lightning.

As PyTorch provides a number of transformer and attention related layers in its torch.nn submodule, this project explicitly avoids the use of:

All other layers provided by torch.nn are allowed, including:

nn.Embedding: For token embedding look-up by vocabulary ID.
nn.LayerNorm: For layer normalization as implemented in Attention Is All You Need.

Other restrictions

Transformer models implemented and made available in other libraries such as HuggingFace's transformers are not used in this project.
However, the tokenizers provided by transformers were used, as developing tokenization algorithms was not the primary objective of this project.
No existing "x from scratch" resources were used, such as the famous Let's build GPT: from scratch, in code, spelled out. by Andrej Karpathy[3].
No other online resources were used, apart from official documentation for packages such as PyTorch, PyTorch Lightning and Huggingface Tokenizers.

Example

Training a causal language model to generate "Florida man"-style news headlines.

from transformers import LlamaTokenizer

from transformer.params import TransformerParams, TemperatureSamplingParams
from transformer.models import CausalLM
from transformer.decoding import TemperatureSamplingDecoder

# initialize HuggingFace tokenizer
tokenizer = LlamaTokenizer.from_pretrained(
    "huggyllama/llama-7b", add_eos_token=True, legacy=False
)
tokenizer.add_special_tokens({"pad_token": "<pad>"})

# initialize the causal language model
model = CausalLM(
    params=TransformerParams(context_length=64),
    tokenizer=tokenizer,
)

# train the language model
model.train(...)

# initialize decoder for sequence generation
decoder = TemperatureSamplingDecoder(
    params=TemperatureSamplingParams(max_length=100, temperature=0.5, k=5),
    model=model,
)

# generation without context
decoder.generate()
'Florida man arrested after baby alligator, guns, drugs found inside truck'

# generation with context
decoder.generate("Florida man shot")
'Florida man shot and killed while attempting to steal pizza and Pokemon cards from Target'

Details

While the original architecture described in Attention Is All You Need is an encoder-decoder based architecture using transformers for neural machine translation which is a sequence-to-sequence learning task, this project was designed to be more general, allowing for a variety of natural language tasks by implementing encoder-only, decoder-only and encoder-decoder architectures.

Datasets

The following datasets were used to test the above transformer implementations on various tasks.

arXiv Paper Abstracts: arXiv manuscripts and their metadata including titles, abstracts and categories.
CommonLit Readability Prize: Literary passages and their associated "readability" score for use in grade 3-12 classrooms.
Reddit r/FloridaMan: News headlines about various (often funny and irrational) actions performed by Florida men and women.
Europarl: Transcriptions of European Parliament proceedings between 1996-2006, collected in 11 languages.

Models and notebooks

Encoder-only models

ClassifierLM: A generic transformer-based language model for assigning classes to text.
- notebooks/arxiv_categorization.ipynb applies this model to the arXiv Paper Abstracts dataset to categorize arXiv manuscripts based on their titles.
RegressorLM: A generic transformer-based language model for assigning scores to text.
- notebooks/commonlit_readability.ipynb applies this model to the CommonLit Readability Prize dataset to rate the complexity of literary passages for grade 3-12 students.

Decoder-only models

CausalLM: A generic transformer-based language model for generating text in an autoregressive manner.
- notebooks/florida_man_generation.ipynb applies this model to the Reddit r/FloridaMan dataset to generate humorous news headlines involving the (mis)adventures of Florida men and women.

Encoder-decoder models

Seq2SeqLM: A generic transformer-based language model for generating output text given an input text.
- notebooks/arxiv_summarization.ipynb applies this model to the arxiv Paper Abstracts dataset to generate arXiv paper titles by summarizing their corresponding abstracts.
- notebooks/europarl_translation.ipynb applies this model to the Europarl dataset to translate transcribed parliamentiary proceedings from French to English.

Repository structure

notebooks/: Notebooks applying the models in transformer.models to various datasets.
transformer/: Core package containing the transformer implementations.
- dataloaders/: LightningDataModules for each model in transformer.models.
- decoding/: Decoding method implementations for causal and sequence-to-sequence LMs.
- models/: Task-specific transformers implemented using transformer.modules.transformers.
- modules/: LightningModules used within the transformers in transformer.models.
  - transformers/: Encoder-only, decoder-only and encoder-decoder transformer definitions.
  - attention.py: Masked/unmasked multi-head self attention definition.
  - block.py: Transformer block definition.
  - embedding.py: Positional encoding and input embedding definition.
- params/: Pydantic hyper-parameter classes.
- utils/: Supporting custom layers, functions and constants.