This is the official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Magpie generates high-quality alignment data by prompting aligned LLMs with their pre-query templates. Unlike many existing synthetic data generation methods, Magpie doesn't rely on prompt engineering or seed questions for generating synthetic data. Instead, it uses the prompt template of an aligned LLM to generate both the user query and an LLM response.
Currently, Magpie has been tested on the Llama-3, Qwen2, Phi 3 and Gemma-2 series. Please submit an issue for more model support.
Model Family | Magpie | Magpie Scripts | Datasets | Size | Note |
---|---|---|---|---|---|
Llama 3.1 | โญ๏ธ | 8B,70B | 70B,405B(Argilla) | 1M | Apply a logits processor to surpress markdown format. |
Llama 3 | โ | 8B,70B | 8B,70B | 3M + 1M | |
Qwen2 | โ | 7B,72B,Math 7B | 7B,72B | 3M + 1M | |
Phi 3 | โ | mini,small,medium | medium | 1M | |
Gemma-2 | โญ๏ธ | 9B,27B | 27B | 534K | Apply a filter before generating responses. |
Qwen2.5 | โญ๏ธ | 3B,7B,14B,32B,72B | |||
Gemma-1.1 | โญ๏ธ | 7B | |||
Llama 2 | โญ๏ธ | 7B,70B | |||
Vicuna | โญ๏ธ | 7B | |||
Mistral | โญ๏ธ | 7B | |||
Yi | โญ๏ธ | 34B | |||
DeepSeek Coder | โญ๏ธ | Coder V2 Lite |
The navigation of all available Magpie datasets can be found here.
We hope Magpie can contribute to the democratization of AI with enhanced transparency of model alignment processes!
Build environment
git clone https://github.com/magpie-align/magpie.git
cd magpie
conda create -n magpie python=3.10 -y
conda activate magpie
pip install -r requirements.txt
Get access to Llama-3 models from ๐ค Huggingface
You can apply for Llama-3 model access here. To login in the terminal, enter:
huggingface-cli login
then enter your Huggingface private key beginning with "hf_".
Play with Jupyter Notebook
The toy example can be found in demo.ipynb
. Have fun!
We use Llama-3-8B-Instruct as an example to demonstrate the batched data generation process. To run batched generation, you can simply run:
cd scripts
bash magpie.sh
The script will generate both instructions and responses in the data folder. It has been tested on an RTX 4090 24G GPU. If you are using GPUs with less memory, consider implementing quantization.
We also provide scripts for other models in the scripts
folder. You can use this navigation to find specific Magpie scripts. Note that for model sizes greater than 8B, you may need 4*A100 GPUs to run the scripts.
After generating instruction-response pairs, you can extend them to multi-turn conversations. To do so, simply run the following command:
bash magpie-multi-turn.sh ***_ins_res.json
where ***_ins_res.json
is the single-turn instruction-response pairs generated in the previous step.
To tag the generated instruction-response pairs, you can run:
cd scripts
bash unitag.sh ***_ins_res.json all
This script will automatically generate quality, difficulty, task category, safety, reward, and language for the generated dataset. You can also generate one tag at a time. For example, if you just want to generate the safety label using device 0, you can run:
cd scripts
bash unitag.sh ***_ins_res.json safety 0
You may generate datasets with different generation configurations. We provide a Jupyter notebook here for concatenating all datasets and converting them to ShareGPT format, which is fully supported by Axolotl for fine-tuning.
Once you have a full dataset converted to ShareGPT format, you can calculate the minimum neighbor distance of each instruction and remove repetitions. To do so, run:
cd exp
python gen_dis.py --input_file ***_sharegpt.jsonl
where ***_sharegpt.jsonl
is the dataset path obtained in the previous step. The Python script will take care of building the FAISS index and calculating the minimum distance.
We provide a Jupyter notebook here for simple filtering. You can adjust the filtering parameters to design and apply your own filter based on your needs.
Please take a look at the recipes directory for instructions and our Magpie model recipes.
If you find the model, data, or code useful, please cite our paper ๐คฉ:
@article{xu2024magpie,
title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
journal={ArXiv},
year={2024},
volume={abs/2406.08464},
url={https://api.semanticscholar.org/CorpusID:270391432}
}