👋🤗🤗👋 Join our WeChat.

Easy and Efficient Fine-tuning LLMs --- 简单高效的大语言模型训练/部署

中文 | English

Introduction

LLamaTuner is an efficient, flexible and full-featured toolkit for fine-tuning LLM (Llama3, Phi3, Qwen, Mistral, ...)

Efficient

Support LLM, VLM pre-training / fine-tuning on almost all GPUs. LLamaTuner is capable of fine-tuning 7B LLM on a single 8GB GPU, as well as multi-node fine-tuning of models exceeding 70B.
Automatically dispatch high-performance operators such as FlashAttention and Triton kernels to increase training throughput.
Compatible with DeepSpeed 🚀, easily utilizing a variety of ZeRO optimization techniques.

Flexible

Support various LLMs (Llama 3, Mixtral, Llama 2, ChatGLM, Qwen, Baichuan, ...).
Support VLM (LLaVA).
Well-designed data pipeline, accommodating datasets in any format, including but not limited to open-source and custom formats.
Support various training algorithms (QLoRA, LoRA, full-parameter fune-tune), allowing users to choose the most suitable solution for their requirements.

Full-featured

Support continuous pre-training, instruction fine-tuning, and agent fine-tuning.
Support chatting with large models with pre-defined templates.

Easy and Efficient Fine-tuning LLMs --- 简单高效的大语言模型训练/部署

Supported Models

Model	Model size	Default module	Template
Baichuan	7B/13B	W_pack	baichuan
Baichuan2	7B/13B	W_pack	baichuan2
BLOOM	560M/1.1B/1.7B/3B/7.1B/176B	query_key_value	-
BLOOMZ	560M/1.1B/1.7B/3B/7.1B/176B	query_key_value	-
ChatGLM3	6B	query_key_value	chatglm3
Command-R	35B/104B	q_proj,v_proj	cohere
DeepSeek (MoE)	7B/16B/67B/236B	q_proj,v_proj	deepseek
Falcon	7B/11B/40B/180B	query_key_value	falcon
Gemma/CodeGemma	2B/7B	q_proj,v_proj	gemma
InternLM2	7B/20B	wqkv	intern2
LLaMA	7B/13B/33B/65B	q_proj,v_proj	-
LLaMA-2	7B/13B/70B	q_proj,v_proj	llama2
LLaMA-3	8B/70B	q_proj,v_proj	llama3
LLaVA-1.5	7B/13B	q_proj,v_proj	vicuna
Mistral/Mixtral	7B/8x7B/8x22B	q_proj,v_proj	mistral
OLMo	1B/7B	q_proj,v_proj	-
PaliGemma	3B	q_proj,v_proj	gemma
Phi-1.5/2	1.3B/2.7B	q_proj,v_proj	-
Phi-3	3.8B	qkv_proj	phi
Qwen	1.8B/7B/14B/72B	c_attn	qwen
Qwen1.5 (Code/MoE)	0.5B/1.8B/4B/7B/14B/32B/72B/110B	q_proj,v_proj	qwen
StarCoder2	3B/7B/15B	q_proj,v_proj	-
XVERSE	7B/13B/65B	q_proj,v_proj	xverse
Yi (1/1.5)	6B/9B/34B	q_proj,v_proj	yi
Yi-VL	6B/34B	q_proj,v_proj	yi_vl
Yuan	2B/51B/102B	q_proj,v_proj	yuan

Supported Training Approaches

Approach	Full-tuning	Freeze-tuning	LoRA	QLoRA
Pre-Training	✅	✅	✅	✅
Supervised Fine-Tuning	✅	✅	✅	✅
Reward Modeling	✅	✅	✅	✅
PPO Training	✅	✅	✅	✅
DPO Training	✅	✅	✅	✅
KTO Training	✅	✅	✅	✅
ORPO Training	✅	✅	✅	✅

Supported Datasets

As of now, we support the following datasets, most of which are all available in the Hugging Face datasets library.

Please refer to data/README.md to learn how to use these datasets. If you want to explore more datasets, please refer to the awesome-instruction-datasets. Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.

pip install --upgrade huggingface_hub
huggingface-cli login

Data Preprocessing

We provide a number of data preprocessing tools in the data folder. These tools are intended to be a starting point for further research and development.

data_utils.py : Data preprocessing and formatting
sft_dataset.py : Supervised fine-tuning dataset class and collator
conv_dataset.py : Conversation dataset class and collator

Model Zoo

We provide a number of models in the Hugging Face model hub. These models are trained with QLoRA and can be used for inference and finetuning. We provide the following models:

Base Model	Adapter	Instruct Datasets	Train Script	Log	Model on Huggingface
llama-7b	FullFinetune	-	-	-
llama-7b	QLoRA	openassistant-guanaco	finetune_lamma7b	wandb log	GaussianTech/llama-7b-sft
llama-7b	QLoRA	OL-CC	finetune_lamma7b
baichuan7b	QLoRA	openassistant-guanaco	finetune_baichuan7b	wandb log	GaussianTech/baichuan-7b-sft
baichuan7b	QLoRA	OL-CC	finetune_baichuan7b	wandb log	-

Requirement

Mandatory	Minimum	Recommend
python	3.8	3.10
torch	1.13.1	2.2.0
transformers	4.37.2	4.41.0
datasets	2.14.3	2.19.1
accelerate	0.27.2	0.30.1
peft	0.9.0	0.11.1
trl	0.8.2	0.8.6

Optional	Minimum	Recommend
CUDA	11.6	12.2
deepspeed	0.10.0	0.14.0
bitsandbytes	0.39.0	0.43.1
vllm	0.4.0	0.4.2
flash-attn	2.3.0	2.5.8

Hardware Requirement

* estimated

Method	Bits	7B	13B	30B	70B	110B	8x7B	8x22B
Full	AMP	120GB	240GB	600GB	1200GB	2000GB	900GB	2400GB
Full	16	60GB	120GB	300GB	600GB	900GB	400GB	1200GB
Freeze	16	20GB	40GB	80GB	200GB	360GB	160GB	400GB
LoRA/GaLore/BAdam	16	16GB	32GB	64GB	160GB	240GB	120GB	320GB
QLoRA	8	10GB	20GB	40GB	80GB	140GB	60GB	160GB
QLoRA	4	6GB	12GB	24GB	48GB	72GB	30GB	96GB
QLoRA	2	4GB	8GB	16GB	24GB	48GB	18GB	48GB

Getting Started

Clone the code

Clone this repository and navigate to the Efficient-Tuning-LLMs folder

git clone https://github.com/jianzhnie/LLamaTuner.git
cd LLamaTuner

Getting Started

main function	Useage	Scripts
train.py	Full finetune LLMs on SFT datasets	full_finetune
train_lora.py	Finetune LLMs by using Lora (Low-Rank Adaptation of Large Language Models finetune)	lora_finetune
train_qlora.py	Finetune LLMs by using QLora (QLoRA: Efficient Finetuning of Quantized LLMs)	qlora_finetune

QLora int4 Finetune

The train_qlora.py code is a starting point for finetuning and inference on various datasets. Basic command for finetuning a baseline model on the Alpaca dataset:

python train_qlora.py --model_name_or_path <path_or_name>

For models larger than 13B, we recommend adjusting the learning rate:

python train_qlora.py –learning_rate 0.0001 --model_name_or_path <path_or_name>

To find more scripts for finetuning and inference, please refer to the scripts folder.

Known Issues and Limitations

Here a list of known issues and bugs. If your issue is not reported here, please open a new issue and describe the problem.

4-bit inference is slow. Currently, our 4-bit inference implementation is not yet integrated with the 4-bit matrix multiplication
Resuming a LoRA training run with the Trainer currently runs on an error
Currently, using bnb_4bit_compute_type='fp16' can lead to instabilities. For 7B LLaMA, only 80% of finetuning runs complete without error. We have solutions, but they are not integrated yet into bitsandbytes.
Make sure that tokenizer.bos_token_id = 1 to avoid generation issues.

License

LLamaTuner is released under the Apache 2.0 license.

Acknowledgements

We thank the Huggingface team, in particular Younes Belkada, for their support integrating QLoRA with PEFT and transformers libraries.

We appreciate the work by many open-source contributors, especially:

Some lmm fine-tuning repos

Citation

Please cite the repo if you use the data or code in this repo.

@misc{Chinese-Guanaco,
  author = {jianzhnie},
  title = {LLamaTuner: Easy and Efficient Fine-tuning LLMs},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/jianzhnie/LLamaTuner}},
}