Robot Agent

Fine-tuned Llama2 13B model designed for ReAct-style and Tree-Of-Thoughts style prompting. The codebase has the following desirable features:

Entire training procedure runs out of the box on a single computer with 32GB of RAM and 24GB of VRAM (i.e. consumer-grade graphics cards such as the RTX 3090 and RTX 4090) with less than 30 hours of compute time.
- Carefully tuned to use no more than 27GiB of RAM and 23.6GiB of VRAM.
- This is accomplished through quantization, FP16, TF32, and the usual gradient accumulation/checkpointing settings.
- Training is fully interruptible/resumable.
Heavily commented, short, clean, and reproducible training code.
- All library dependency versions fully pinned, base models and datasets are pinned and downloaded as part of setup process.
- After initial setup, training process does not require network access - entire project folder is portable, can be moved into airgapped and offline environments.
- Use SafeTensors everywhere for speed and security.

Technical details:

Based on Llama2 13B.
QLoRA training, a 128 rank LoRA similar to Guanaco.
2048-token context window used in supervised finetuning, 1536-token context window used in direct preference finetuning.
Supervised finetuning using Airoboros' self-instruct dataset, generated by Airoboros' self-instruct implementation.
- The dataset has been filtered for refusals, and so could be considered "uncensored".
- The dataset generation code also uses a GPT4 jailbreak to reduce the number of refusals in the first place.
Direct preference finetuning using Anthropic's hh-rlhf dataset
- This replaces the reward modelling and reinforcement learning steps in a standard RLHF pipeline.
Codebase takes ideas and inspiration from StackLLaMa, QLoRA, LLaMA-TRL, Airoboros, .

Roadmap

Full reproducible environment with all datasets, base models, and dependencies included.
Supervised finetuning script using high-quality publically-available instruct datasets.
Human-preference finetuning script based on Anthropic's hh-rlhf "helpfulness" dataset.
Accidentally delete the training results on my GPU server and start the training over again from scratch.
Fiddle with agentic dataset generation using Charades dataset.
If that doesn't work, fiddle with video captioning using multimodal models like Otter to generate agentic captions from how-to videos on Youtube.

Prompt Format

### Human:
INSTRUCTIONS_GO_HERE

### Assistant:

Note that there is a single newline at the end of the prompt. Example:

### Human:
What color is the sky?

### Assistant:
The sky is blue.

Training

First, download everything that requires an internet connection into the current project folder. It will increase to around 30GiB in size:

make download-datasets-and-models

Next, transfer the current project folder to the training machine, where the rest of the training can be performed fully offline:

make train

Inference

To use the model, a simple chat-like interface is included for demo purposes, it's not very fancy but it's good enough for testing purposes:

make chat

Using Llama.cpp

First, run the following command to create ./exported-models/ggml-robot-agent-q5_K_M.bin, an 8.6GiB GGML file compatible with Llama.cpp:

make generate-ggml

Now to load the model using Llama.cpp:

make chat-llama-cpp

To use Llama.cpp manually, navigate to your llama.cpp folder and start using the model with the following command (replace PATH_TO_PROJECT_FOLDER with the path to the current project folder):

./main --model PATH_TO_PROJECT_FOLDER/exported-models/ggml-robot-agent-q5_K_M.bin --color --interactive --interactive-first --mirostat 2 --ctx-size 2048 --reverse-prompt $'\n\n### Human:\n' --prompt $'\n\n### Human:\n' --in-suffix $'\n### Assistant:\n'