用于从头预训练+SFT一个小参数量的中文LLaMa2的仓库;24G单卡即可运行得到一个具备简单中文问答能力的chat-llama2.
MIT License
Created by Limzero & Ambrose & Guolin
Llama2
SFT****
LLM
LLM
# 1. Baby-llama2-chinese Corpus-634tokens118G
# 2. ./data/
# 3. data_process.pydata_path_list
# 4. data_process.py./data/pretrain_data.bin
python data_process.py
# 5. pretrain.pymax_seq_lendimn_layersn_headsbatch_size
# 6. pretrain.py4*3090
screen -S ambrose #(ambrosescreen)
screen -r ambrose #(ambrosescreen)
torchrun --standalone --nproc_per_node=4 pretrain.py
# 7. out/pretrain
# 8. alpaca-zhbellSFTSFTsft_data_process.py
python sft_data_process.py
# 9. ./sft_datasft_data.csv
# 10. SFT
python sft.py
# 11. SFTout/sft
# 12. SFTeval.py
python eval.py
TokenizerLLMcustom tokenizersChatGLM2-6BLlama2
llama700llamaChatGLM2-6B64793uint16065535tokenint32
**Corpus for pre-training **LLM
Wikiwikipedia-cn-20230720-filtered | Wikipedia |
BaiduBaiKe : bwvb | BaiduBaiKe |
C4_zh part1 zv4r part2 sb83 part3 l89d | C43.651560tokenC4_zh |
WuDaoCorporaBAAIWuDaoCorpora Text | 200G |
shibing624/medicalshibing624/medical | shibing624 |
ChatGLM2-6B634TokensBaby-llama2-chinese Corpus 6unr./data
43090634Tokens+300M-DeepSpeedMegatron
......
#MinhashSimhash
cd data_clean
python clear.py
#budubaikebaike.parquetall_no_dulpticates.parquetall_no_dulpticates_simhash.parquet
( CPUIntel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz)
budubaike5634898
Time Consuming | |||
---|---|---|---|
process_baike() | + | 5634898 to 3605212 | 552.046 s |
remove_dataset_duplicate_rows() | Minhash | 3605212 to 2736033 | 4 h |
remove_dataset_duplicate_rows_simhash() | Simhash | 3605212 to 3548779 | 23 min |
****GPT<eos>
np.uint16.binmmap
#
python data_process.py
#./datapretrain_data.bin
#
screen -S ambrose #(ambrosescreen)
screen -r ambrose #(ambrosescreen)
#screennproc_per_node4
torchrun --standalone --nproc_per_node=4 pretrain.py
#out/pretrain
LLM
Full Fine-tuningParameter Efficient Fine-tuning or Prompt EngineeringRetrieval Augmented GenerationtemplateRAG
218Mbert-large-340MFull Fine-tuning
SFT
SFT
SFT | |
---|---|
alpaca-zhalpaca-zh | shibing624SFTAlpacaGPT4self-instruct5 |
bellbell | BelleGroupSFT100BELLE |
SFT
SFT | |
---|---|
shibing624/medicalshibing624/medical | shibing624SFT |
HuatuoGPT-sft-data-v1HuatuoGPT-sft-data-v1 | HuatuoGPTSFT |
DISC-Med-SFTHuatuoGPT-sft-data-v1 | DISC-Med-SFT Dataset |
ChatMed_Consult-v0.3michaelwzhu/ChatMed_Consult-v0.3 | , ChatMed-Dataset, query(prompt)(549,326)/responseOpenAI GPT-3.5 |
SFTDataloaderbatchdataset_sft.py
<bos>
answer<eos>
#alpaca-zhbellSFTSFT
python sft_data_process.py
#./sft_datasft_data.csv
#
screen -S ambrose #(ambrosescreen)
screen -r ambrose #(ambrosescreen)
#screen
python sft.py
#SFTout/sft
loss
v182.78 Tokens Wiki + BaiduBaiKe + shibing624/medical
Llama2-Chinese-92M-v1 vs Llama2-Chinese-92M-v1-smallvocab vs Llama2-Chinese-218M-v1 v2140 Tokens Wiki + BaiduBaiKe + shibing624/medical + C4_zh
Llama2-Chinese-92M-v2 vs Llama2-Chinese-218M-v2 v3634 Tokens Wiki + BaiduBaiKe + shibing624/medical + C4_zh + WuDaoCorpora
Llama2-Chinese-218M-v3
#eval_pretrain.py
python eval_pretrain.py
#Input
Llama2-Chinese-92M-v1 response
Llama2-Chinese-92M-v2 response3 1 23 4 5
Llama2-Chinese-218M-v1 response2 3 4 , 5 6
Llama2-Chinese-218M-v2 response12
Llama2-Chinese-218M-v3 response;
#Input
Llama2-Chinese-92M-v1 response:
Llama2-Chinese-92M-v2 response
Llama2-Chinese-218M-v1 response
Llama2-Chinese-218M-v2 response
Llama2-Chinese-218M-v3 response
#SFTeval.py
python eval.py
#Input?
Llama2-Chinese-92M-v1-NormalChat response
Llama2-Chinese-92M-v1-MedicalChat response1. 2. 3.
Llama2-Chinese-92M-v2-NormalChat response
Llama2-Chinese-92M-v2-MedicalChat response
Llama2-Chinese-218M-v1-NormalChat response
Llama2-Chinese-218M-v1-MedicalChat response
Llama2-Chinese-218M-v2-NormalChat response 1. 2. 3.
Llama2-Chinese-218M-v2-MedicalChat response
Llama2-Chinese-218M-v3-NormalChat response
Llama2-Chinese-218M-v3-MedicalChat response1. 2. 3. 4.
#Input
Llama2-Chinese-92M-v1-NormalChat response
Llama2-Chinese-92M-v1-MedicalChat response
Llama2-Chinese-92M-v2-NormalChat responsee-
Llama2-Chinese-92M-v2-MedicalChat response
Llama2-Chinese-218M-v1-NormalChat response
Llama2-Chinese-218M-v1-MedicalChat response
Llama2-Chinese-218M-v2-NormalChat response.
Llama2-Chinese-218M-v2-MedicalChat response
Llama2-Chinese-218M-v3-NormalChat response
Llama2-Chinese-218M-v3-MedicalChat response
#Input
Llama2-Chinese-92M-v1-NormalChat response
Llama2-Chinese-92M-v1-MedicalChat response38% 10%
Llama2-Chinese-92M-v2-NormalChat response3017330738
Llama2-Chinese-92M-v2-MedicalChat response
Llama2-Chinese-218M-v1-NormalChat response
Llama2-Chinese-218M-v1-MedicalChat response
Llama2-Chinese-218M-v2-NormalChat response4000500070009000 1.21.4
Llama2-Chinese-218M-v2-MedicalChat response
Llama2-Chinese-218M-v3-MedicalChat response
Llama2-Chinese-218M-v3-NormalChat response
medical SFT
LLMQQ: 716455397