baby-llama2-chinese

用于从头预训练+SFT一个小参数量的中文LLaMa2的仓库;24G单卡即可运行得到一个具备简单中文问答能力的chat-llama2.

MIT License

Stars
2.5K

Baby-Llama2-Chinese

Created by Limzero & Ambrose & Guolin

Llama2

SFT****

LLM

LLM

  • 500M-1BLlama2-Chinese
  • SFTLLMDeepSpeedMegatron
  • LLM

Quick Start

# 1. Baby-llama2-chinese Corpus-634tokens118G
# 2. ./data/
# 3. data_process.pydata_path_list
# 4. data_process.py./data/pretrain_data.bin
python data_process.py
# 5.  pretrain.pymax_seq_lendimn_layersn_headsbatch_size
# 6.  pretrain.py4*3090
screen -S ambrose    #(ambrosescreen)
screen -r ambrose    #(ambrosescreen)
torchrun --standalone --nproc_per_node=4 pretrain.py
# 7. out/pretrain
# 8. alpaca-zhbellSFTSFTsft_data_process.py
python sft_data_process.py
# 9. ./sft_datasft_data.csv
# 10. SFT
python sft.py
# 11. SFTout/sft

# 12. SFTeval.py
python eval.py

  • 2024012484tokensLlama2-Chinese-92M-v1-smallvocabLlama2-Chinese-218M-v1Llama2-Chinese-92M-v1
  • 20240229634tokensLlama2-Chinese-218M-v3SFTfinetuneLlama2-Chinese-218M-v3-MedicalChat
  • 20240521MinhashSimhashclean_databudubaike


  1. TokenizerLLMcustom tokenizersChatGLM2-6BLlama2

    llama700llamaChatGLM2-6B64793uint16065535tokenint32

  2. **Corpus for pre-training **LLM

    Wikiwikipedia-cn-20230720-filtered Wikipedia
    BaiduBaiKe : bwvb BaiduBaiKe
    C4_zh part1 zv4r part2 sb83 part3 l89d C43.651560tokenC4_zh
    WuDaoCorporaBAAIWuDaoCorpora Text 200G
    shibing624/medicalshibing624/medical shibing624

    ChatGLM2-6B634TokensBaby-llama2-chinese Corpus 6unr./data

    43090634Tokens+300M-DeepSpeedMegatron


  1. ......

    #MinhashSimhash
    cd data_clean
    python clear.py
    #budubaikebaike.parquetall_no_dulpticates.parquetall_no_dulpticates_simhash.parquet
    

    ( CPUIntel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz)

    budubaike5634898

    Time Consuming
    process_baike() + 5634898 to 3605212 552.046 s
    remove_dataset_duplicate_rows() Minhash 3605212 to 2736033 4 h
    remove_dataset_duplicate_rows_simhash() Simhash 3605212 to 3548779 23 min
    • parquet
    • MinhashSimhash
  2. ****GPT<eos>np.uint16.binmmap

    #
    python data_process.py
    #./datapretrain_data.bin
    

#
screen -S ambrose    #(ambrosescreen)
screen -r ambrose    #(ambrosescreen)
#screennproc_per_node4
torchrun --standalone --nproc_per_node=4 pretrain.py
#out/pretrain

SFT

LLM


  1. LLM

    • Full Fine-tuningLLM
    • Parameter Efficient Fine-tuningLoRAAdapterPrefix-tuningP-tuningP-tuning v2
    • Prompt Engineering
    • Retrieval Augmented Generation

    Full Fine-tuningParameter Efficient Fine-tuning or Prompt EngineeringRetrieval Augmented GenerationtemplateRAG

    218Mbert-large-340MFull Fine-tuning

  2. SFTLLM2023SFT

    SFT

    SFT

    SFT
    alpaca-zhalpaca-zh shibing624SFTAlpacaGPT4self-instruct5
    bellbell BelleGroupSFT100BELLE

    SFT

    SFT
    shibing624/medicalshibing624/medical shibing624SFT
    HuatuoGPT-sft-data-v1HuatuoGPT-sft-data-v1 HuatuoGPTSFT
    DISC-Med-SFTHuatuoGPT-sft-data-v1 DISC-Med-SFT Dataset
    ChatMed_Consult-v0.3michaelwzhu/ChatMed_Consult-v0.3 , ChatMed-Dataset, query(prompt)(549,326)/responseOpenAI GPT-3.5

    SFT

    SFTDataloaderbatchdataset_sft.py

    • promptanswer<bos>answer<eos>
    • losspromptlossmaskanswerloss
    #alpaca-zhbellSFTSFT
    python sft_data_process.py
    #./sft_datasft_data.csv
    

    Full Fine-tuning

    #
    screen -S ambrose    #(ambrosescreen)
    screen -r ambrose    #(ambrosescreen)
    #screen
    python sft.py
    #SFTout/sft
    


  1. Llama2-Chinese-92M-v1 82.78 TokensWiki+BaiduBaiKe+shibing624/medical max_seq_len=512dim=512n_layers=8n_heads=8 da7h
    Llama2-Chinese-92M-v2 140 TokensWiki+BaiduBaiKe+shibing624/medical+C4_zh max_seq_len=512dim=512n_layers=8n_heads=8 bjal
    Llama2-Chinese-92M-v1-smallvocabNotes:vocab size:21131 82.78 TokensWiki+BaiduBaiKe+shibing624/medical max_seq_len=512dim=512n_layers=8n_heads=8 ttst
    Llama2-Chinese-218M-v1 82.78 TokensWiki+BaiduBaiKe+shibing624/medical max_seq_len=1024dim=1024n_layers=12n_heads=8 c10m
    Llama2-Chinese-218M-v2 140 TokensWiki+BaiduBaiKe+shibing624/medical+C4_zh max_seq_len=1024dim=1024n_layers=12n_heads=8 dkne
    Llama2-Chinese-218M-v3 634 TokensWiki+BaiduBaiKe+shibing624/medical+C4_zh+WuDaoCorpora max_seq_len=1024dim=1024n_layers=12n_heads=8 tpyy

    loss

    v182.78 Tokens Wiki + BaiduBaiKe + shibing624/medical

    Llama2-Chinese-92M-v1 vs Llama2-Chinese-92M-v1-smallvocab vs Llama2-Chinese-218M-v1 loss_tokens-v1.png v2140 Tokens Wiki + BaiduBaiKe + shibing624/medical + C4_zh

    Llama2-Chinese-92M-v2 vs Llama2-Chinese-218M-v2 loss_tokens.png v3634 Tokens Wiki + BaiduBaiKe + shibing624/medical + C4_zh + WuDaoCorpora

    Llama2-Chinese-218M-v3 loss_tokens-v3.png


    #eval_pretrain.py
    python eval_pretrain.py
    
    #Input
    Llama2-Chinese-92M-v1 response   
    Llama2-Chinese-92M-v2 response3 1 23  4  5 
    Llama2-Chinese-218M-v1 response2 3 4   , 5   6 
    Llama2-Chinese-218M-v2 response12           
    Llama2-Chinese-218M-v3 response;
    
    #Input
    Llama2-Chinese-92M-v1 response: 
    Llama2-Chinese-92M-v2 response      
    Llama2-Chinese-218M-v1 response
    Llama2-Chinese-218M-v2 response
    Llama2-Chinese-218M-v3 response
    

  2. SFT
    Llama2-Chinese-92M-v1-NormalChat alpaca-zh+bell max_seq_len=512dim=512n_layers=8n_heads=8 da7h
    Llama2-Chinese-92M-v1-MedicalChat shibing624/medical+HuatuoGPT-sft-data-v1+DISC-Med-SFT+ChatMed_Consult-v0.3 max_seq_len=512dim=512n_layers=8n_heads=8 da7h
    Llama2-Chinese-92M-v2-NormalChat alpaca-zh+bell max_seq_len=512dim=512n_layers=8n_heads=8 bjal
    Llama2-Chinese-92M-v2-MedicalChat shibing624/medical+HuatuoGPT-sft-data-v1+DISC-Med-SFT+ChatMed_Consult-v0.3 max_seq_len=512dim=512n_layers=8n_heads=8
    Llama2-Chinese-218M-v1-NormalChat alpaca-zh+bell max_seq_len=1024dim=1024n_layers=12n_heads=8
    Llama2-Chinese-218M-v1-MedicalChat shibing624/medical+HuatuoGPT-sft-data-v1+DISC-Med-SFT+ChatMed_Consult-v0.3 max_seq_len=1024dim=1024n_layers=12n_heads=8
    Llama2-Chinese-218M-v2-NormalChat alpaca-zh+bell max_seq_len=1024dim=1024n_layers=12n_heads=8 dkne
    Llama2-Chinese-218M-v2-MedicalChat shibing624/medical+HuatuoGPT-sft-data-v1+DISC-Med-SFT+ChatMed_Consult-v0.3 max_seq_len=1024dim=1024n_layers=12n_heads=8
    Llama2-Chinese-218M-v3-NormalChat alpaca-zh+bell max_seq_len=1024dim=1024n_layers=12n_heads=8
    Llama2-Chinese-218M-v3-MedicalChat shibing624/medical+HuatuoGPT-sft-data-v1+DISC-Med-SFT+ChatMed_Consult-v0.3 max_seq_len=1024dim=1024n_layers=12n_heads=8 tpyy
    SFT
    #SFTeval.py
    python eval.py
    
    #Input?
    Llama2-Chinese-92M-v1-NormalChat response
    Llama2-Chinese-92M-v1-MedicalChat response1. 2. 3. 
    Llama2-Chinese-92M-v2-NormalChat response
    Llama2-Chinese-92M-v2-MedicalChat response
    Llama2-Chinese-218M-v1-NormalChat response
    Llama2-Chinese-218M-v1-MedicalChat response
    Llama2-Chinese-218M-v2-NormalChat response 1. 2. 3. 
    Llama2-Chinese-218M-v2-MedicalChat response
    Llama2-Chinese-218M-v3-NormalChat response
    Llama2-Chinese-218M-v3-MedicalChat response1. 2. 3. 4. 
    
    #Input
    Llama2-Chinese-92M-v1-NormalChat response
    Llama2-Chinese-92M-v1-MedicalChat response
    Llama2-Chinese-92M-v2-NormalChat responsee-
    Llama2-Chinese-92M-v2-MedicalChat response
    Llama2-Chinese-218M-v1-NormalChat response
    Llama2-Chinese-218M-v1-MedicalChat response
    Llama2-Chinese-218M-v2-NormalChat response.
    Llama2-Chinese-218M-v2-MedicalChat response
    Llama2-Chinese-218M-v3-NormalChat response
    Llama2-Chinese-218M-v3-MedicalChat response
    
    
    #Input
    Llama2-Chinese-92M-v1-NormalChat response
    Llama2-Chinese-92M-v1-MedicalChat response38% 10%
    Llama2-Chinese-92M-v2-NormalChat response3017330738
    Llama2-Chinese-92M-v2-MedicalChat response
    Llama2-Chinese-218M-v1-NormalChat response
    Llama2-Chinese-218M-v1-MedicalChat response
    Llama2-Chinese-218M-v2-NormalChat response4000500070009000 1.21.4
    Llama2-Chinese-218M-v2-MedicalChat response
    Llama2-Chinese-218M-v3-MedicalChat response
    Llama2-Chinese-218M-v3-NormalChat response
    

    medical SFT

LLMQQ: 716455397

Llama2

-ChatLM-mini-Chinese