LLaMa 7b in rust

This repo contains the popular LLaMa 7b language model, fully implemented in the rust programming language!

Uses dfdx tensors and CUDA acceleration.

This runs LLaMa directly in f16, meaning there is no hardware acceleration on CPU. Using CUDA is heavily recommended.

Here is the 7b model running on an A10 GPU:

How To Run

(Once) Setting up model weights

Download model weights

Install git lfs. On ubuntu you can run sudo apt install git-lfs
Activate git lfs with git lfs install.
Run the following commands to download the model weights in pytorch format (~25 GB):
1. LLaMa 7b (~25 GB): git clone https://huggingface.co/decapoda-research/llama-7b-hf
2. LLaMa 13b (~75 GB): git clone https://huggingface.co/decapoda-research/llama-13b-hf
3. LLaMa 65b (~244 GB): git clone https://huggingface.co/decapoda-research/llama-65b-hf

Convert the model

(Optional) Run python3.x -m venv <my_env_name> to create a python virtual environment, where x is your prefered python version
(Optional, requires 1.) Run source <my_env_name>\bin\activate (or <my_env_name>\Scripts\activate if on Windows) to activate the environment
Run pip install numpy torch
Run python convert.py to convert the model weights to rust understandable format:
a. LLaMa 7b: python convert.py
b. LLaMa 13b: python convert.py llama-13b-hf
c. LLaMa 65b: python convert.py llama-65b-hf

(Once) Compile

You can compile with normal rust commands:

With cuda:

cargo build --release -F cuda

Without cuda:

cargo build --release

Run the executable

With default args:

./target/release/llama-dfdx --model <model-dir> generate "<prompt>"
./target/release/llama-dfdx --model <model-dir> chat
./target/release/llama-dfdx --model <model-dir> file <path to prompt file>

To see what commands/custom args you can use: