MiniCPM3-4B: An edge-side LLM that surpasses GPT-3.5-Turbo.
APACHE-2.0 License
Note: More model versions can be found here.
MiniCPM 3.0 is a language model with 4 billion parameters. Compared to MiniCPM 1.0/2.0, it offers more comprehensive features and a significant improvement in overall capabilities. Its performance on most evaluation benchmarks rivals or even surpasses many models with 7B-9B parameters.
We evaluate the function calling capability of MiniCPM3 on Berkeley Function Calling Leaderboard (BFCL). MiniCPM3-4B outperforms several models with 7B-9B parameters on this leaderboard, surpassing GPT-3.5-Turbo-0125.
In the Needle in a Haystack test with a context length of 32k, the results are shown as follows:
We also propose a divide-and-conquer long-sequence processing framework LLMxMapReduce to support text with any length. MiniCPM3xMapReduce can achieve comparable performance with GPT-4 and KimiChat.
Context length | Qwen2-70b | Kimi-Chat(2024.06) | GPT-4 (From InfiniteBench) | MiniCPM 3.0 x MR | Qwen2-70b x MR | Llama3-70bx MR | |
---|---|---|---|---|---|---|---|
Math.Find | 87.9k | 59.71% | 18.57% | 60.00% | 83.43% | 54.29% | 91.43% |
Retrieve.KV | 89.9k | 29.00% | 69.20% | 89.00% | 93.80% | 98.80% | 98.89% |
En.Dia | 103.6K | 23.00% | 23.00% | 7.50% | 12.50% | 46.50% | 17.50% |
Code.Debug | 114.7k | 45.43% | 38.32% | 54.31% | 25.63% | 54.82% | 62.94% |
Retrieve.Number | 122.4k | 100.00% | 97.45% | 100.00% | 99.32% | 100.00% | 99.79% |
Retrieve.PassKey | 122.4k | 100.00% | 99.32% | 100.00% | 98.81% | 100.00% | 100.00% |
En.Sum | 171.5K | 31.85% | 29.94% | 14.73% | 25.89% | 32.39% | 30.63% |
En.MC | 184.4k | 81.66% | 79.91% | 68.12% | 66.38% | 83.84% | 82.10% |
En.QA | 192.6k | 21.97% | 18.80% | 22.44% | 28.39% | 23.13% | 34.70% |
Zh.QA | 2068.6k | 21.40% | 19.84% | 25.96% | 23.66% | 19.10% | N/A |
avg w/o Zh.QA | / | 51.92% | 52.96% | 55.33% | 59.29% | 64.98% | 68.64% |
avg | / | 48.86% | 49.65% | 52.39% | 55.55% | 60.39% | N/A |
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)
path = 'openbmb/MiniCPM3-4B'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
print(responds)
Refer to SGLang repo to install the latest version via source code.
python -m sglang.launch_server --model openbmb/MiniCPM3-4B --trust-remote-code --port 30000 --chat-template chatml
from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint
@function
def multi_turn_question(s, question_1, question_2):
s += user(question_1)
s += assistant(gen("answer_1", max_tokens=1024))
s += user(question_2)
s += assistant(gen("answer_2", max_tokens=1024))
set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = multi_turn_question.run(
question_1="Introduce artificial intelligence",
question_2="Write an article about it",
)
for m in state.messages():
print(m["role"], ":", m["content"])
pip install "vllm>=0.6.2"
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_name = "openbmb/MiniCPM3-4B"
prompt = [{"role": "user", "content": "Write an article about Artificial Intelligence."}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
llm = LLM(model=model_name,
trust_remote_code=True,
tensor_parallel_size=1
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024)
outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
We have provided the GGUF formats of MiniCPM3, which can be used in llama.cpp.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
./llama-cli -c 1024 -m minicpm3-4b-fp16.gguf -n 1024 --top-p 0.7 --temp 0.7 --prompt "<|im_start|>user\nWrite an article about Artificial Intelligence.<|im_end|>\n<|im_start|>assistant\n"
We have supported fine-tuning MiniCPM3 using LLaMA-Factory. For usage instructions, refer to LLaMA-Factory Fine-tuning."
We use vLLM in the example code for the following advanced features.
We provide example code for using function calls with MiniCPM3:
cd demo/minicpm3/function_call
python function_call.py
If you want to start a function call service, use the following commands:
cd demo/minicpm3/function_call
pip install -r requirements.txt
python openai_api_server.py \
--model openbmb/MiniCPM3-4B \
--served-model-name MiniCPM3-4B \
--chat-template chatml.jinja \
--dtype auto \
--api-key token-abc123 \
--tensor-parallel-size 1 \
--trust-remote-code
Below is a demo of using a search engine to answer the question:
We provide example code for using the code interpreter with MiniCPM3:
cd demo/minicpm3/code_interpreter
pip install -r requirements.txt
python code_interpreter.py openbmb/MiniCPM3-4B
Below is an example of using the code interpreter to generate a QR code:
MiniCPM 2.0 series upgrade MiniCPM in multiple dimensions, including:
Model | avg | avg w/o code&math | passkey | number_string | kv_retrieval | longbook_choice_eng | longbook_qa_chn | longbook_qa_eng | longbook_sum_eng | longdialogue_qa_eng | math_calc | math_find | code_debug | code_run |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LWM-Text-128k | 24.45 | 33.62 | 100 | 97.8 | 0.6 | 28.82 | 15.93 | 14.31 | 9.99 | 1.5 | 0 | 3.43 | 20.05 | 1 |
Yarn-Mistral-7b-128k | 19.84 | 27.36 | 92.71 | 0 | 27.95 | 15.49 | 9.55 | 9.06 | 7.5 | 0 | 17.14 | 0.76 | 1.25 | |
Mistral-7B-Instruct-v0.2(ABF 1000w) | 27.75 | 36.9 | 100 | 78.98 | 3.6 | 37.12 | 11.74 | 17.37 | 21.12 | 9.5 | 0 | 29.43 | 17.51 | 0 |
Yi-6B-200k | 22.15 | 32.54 | 100 | 94.92 | 0 | 36.68 | 15.07 | 9.2 | 0.92 | 3.5 | 0 | 4.29 | 0.51 | 0.75 |
chatglm3-6b-128k | 25.58 | 36.57 | 89.93 | 99.66 | 5.2 | 46.29 | 10.7 | 8.38 | 25.91 | 6.5 | 0 | 8 | 5.33 | 1 |
MiniCPM-2.4B-128k | 27.32 | 37.68 | 98.31 | 99.83 | 9 | 29.69 | 23.06 | 16.33 | 15.73 | 9.5 | 0 | 4.29 | 22.08 | 0 |
Note* means evaluation results are directly taken from their technical reports. † means evaluation results on the full set of MBPP, instead of the hand-verified set.
Setting | AverageSparsity | AveragePerformance | CodeGeneration | CommonsenseReasoning | ReadingComprehension | GSM8K | MMLU | BBH | AGI-Eval |
---|---|---|---|---|---|---|---|---|---|
LLaMA2-7B | - | 37.96 | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 |
ReluLLaMA-7B | 66.98 | 37.62 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 |
ProSparse-7B* | 88.11 | 38.31 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 |
ProSparse-7B | 89.32 | 38.46 | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 |
LLaMA2-13B | - | 44.06 | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 |
ReluLLaMA-13B | 71.56 | 42.74 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 |
ProSparse-13B* | 87.97 | 45.07 | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 |
ProSparse-13B | 88.80 | 44.90 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 |
MiniCPM-1B | - | 44.44 | 36.85 | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 |
MiniCPM-S-1B* | 86.25 | 44.72 | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 |
MiniCPM-S-1B | 87.89 | 44.72 | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 |
Note
Please refer to Inference section in MiniCPM1.0.
Currently, PowerInfer is exclusively tailored for the MiniCPM-S-1B model; support for other versions is not yet available, stay tuned.
# Download the installation package
sudo wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz
# Extract the installation package
sudo tar -zxvf cmake-3.23.0.tar.gz
# Configure the installation environment
sudo ./configure
sudo make -j8
# Compile and install
sudo make install
# Check the version after installation
cmake --version
# If the version number is returned, the installation was successful
# cmake version 3.23.0
git clone https://github.com/SJTU-IPADS/PowerInfer
cd PowerInfer
pip install -r requirements.txt # install Python helpers' dependencies
cmake -S . -B build
cmake --build build --config Release
cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release
git clone https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf/tree/main
#or
git clone https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft-gguf
cd PowerInfer
# Below is the command template. output_token_count refers to the maximum output tokens, thread_num is the number of threads, and prompt is the input prompt text.
#./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt
# Below is an example
./build/bin/main -m /root/ld/ld_model_pretrain/1b-s-minicpm/MiniCPM-S-1B-sft.gguf -n 2048 -t 8 -p '<User>hello,tell me a story please.<AI>'
MiniCPM-2B is a dense language model with only 2.4B parameters excluding embeddings (2.7B in total).
MiniCPM has very close performance compared with Mistral-7B on open-sourced general benchmarks with better ability on Chinese, Mathematics and Coding after SFT. The overall performance exceeds Llama2-13B, MPT-30B, Falcon-40B, etc.
After DPO, MiniCPM outperforms Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, Zephyr-7B-alpha, etc. on MTBench.
Note: To ensure the generality of the model for academic research purposes, we have not subject it to any identity-specific training. Meanwhile, as we use ShareGPT open-source corpus as part of the training data, the model may output identity-related information similar to the GPT series models.
git clone https://github.com/OpenBMB/UltraEval.git
cd UltraEval
pip install -e .
wget -O RawData.zip "https://cloud.tsinghua.edu.cn/f/71b5232264ae4833a4d0/?dl=1"
unzip RawData.zip
python data_process.py
bash run_eval.sh
Model | Average Score | Average Score in English | Average Score in Chinese | C-Eval | CMMLU | MMLU | HumanEval | MBPP | GSM8K | MATH | BBH | ARC-E | ARC-C | HellaSwag |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama2-7B | 35.40 | 36.21 | 31.765 | 32.42 | 31.11 | 44.32 | 12.2 | 27.17 | 13.57 | 1.8 | 33.23 | 75.25 | 42.75 | 75.62* |
Qwen-7B | 49.46 | 47.19 | 59.655 | 58.96 | 60.35 | 57.65 | 17.07 | 42.15 | 41.24 | 5.34 | 37.75 | 83.42 | 64.76 | 75.32* |
Deepseek-7B | 39.96 | 39.15 | 43.635 | 42.82 | 44.45 | 47.82 | 20.12 | 41.45 | 15.85 | 1.53 | 33.38 | 74.58* | 42.15* | 75.45* |
Mistral-7B | 48.97 | 49.96 | 44.54 | 46.12 | 42.96 | 62.69 | 27.44 | 45.2 | 33.13 | 5.0 | 41.06 | 83.92 | 70.73 | 80.43* |
Llama2-13B | 41.48 | 42.44 | 37.19 | 37.32 | 37.06 | 54.71 | 17.07 | 32.55 | 21.15 | 2.25 | 37.92 | 78.87* | 58.19 | 79.23* |
MPT-30B | 38.17 | 39.82 | 30.715 | 29.34 | 32.09 | 46.56 | 21.95 | 35.36 | 10.31 | 1.56 | 38.22 | 78.66* | 46.08* | 79.72* |
Falcon-40B | 43.62 | 44.21 | 40.93 | 40.29 | 41.57 | 53.53 | 24.39 | 36.53 | 22.44 | 1.92 | 36.24 | 81.94* | 57.68 | 83.26* |
MiniCPM-2B | 52.33 | 52.6 | 51.1 | 51.13 | 51.07 | 53.46 | 50.00 | 47.31 | 53.83 | 10.24 | 36.87 | 85.44 | 68.00 | 68.25 |
Model | Average Score | Average Score in English | Average Score in Chinese | C-Eval | CMMLU | MMLU | HumanEval | MBPP | GSM8K | MATH | BBH | ARC-E | ARC-C | HellaSwag |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TinyLlama-1.1B | 25.36 | 25.55 | 24.525 | 25.02 | 24.03 | 24.3 | 6.71 | 19.91 | 2.27 | 0.74 | 28.78 | 60.77* | 28.15* | 58.33* |
Qwen-1.8B | 34.72 | 31.87 | 47.565 | 49.81 | 45.32 | 43.37 | 7.93 | 17.8 | 19.26 | 2.42 | 29.07 | 63.97* | 43.69 | 59.28* |
Gemini Nano-3B | - | - | - | - | - | - | - | 27.2(report) | 22.8(report) | - | 42.4(report) | - | - | - |
StableLM-Zephyr-3B | 43.46 | 46.31 | 30.615 | 30.34 | 30.89 | 45.9 | 35.37 | 31.85 | 52.54 | 12.49 | 37.68 | 73.78 | 55.38 | 71.87* |
Phi-2-2B | 48.84 | 54.41 | 23.775 | 23.37 | 24.18 | 52.66 | 47.56 | 55.04 | 57.16 | 3.5 | 43.39 | 86.11 | 71.25 | 73.07* |
MiniCPM-2B | 52.33 | 52.6 | 51.1 | 51.13 | 51.07 | 53.46 | 50.00 | 47.31 | 53.83 | 10.24 | 36.87 | 85.44 | 68.00 | 68.25 |
Model | Average Score | Average Score in English | Average Score in Chinese | C-Eval | CMMLU | MMLU | HumanEval | MBPP | GSM8K | MATH | BBH | ARC-E | ARC-C | HellaSwag |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ChatGLM2-6B | 37.98 | 35.17 | 50.63 | 52.05 | 49.21 | 45.77 | 10.37 | 9.38 | 22.74 | 5.96 | 32.6 | 74.45 | 56.82 | 58.48* |
Mistral-7B-Instruct-v0.1 | 44.36 | 45.89 | 37.51 | 38.06 | 36.96 | 53.56 | 29.27 | 39.34 | 28.73 | 3.48 | 39.52 | 81.61 | 63.99 | 73.47* |
Mistral-7B-Instruct-v0.2 | 50.91 | 52.83 | 42.235 | 42.55 | 41.92 | 60.51 | 36.59 | 48.95 | 40.49 | 4.95 | 39.81 | 86.28 | 73.38 | 84.55* |
Qwen-7B-Chat | 44.93 | 42.05 | 57.9 | 58.57 | 57.23 | 56.03 | 15.85 | 40.52 | 42.23 | 8.3 | 37.34 | 64.44* | 39.25* | 74.52* |
Yi-6B-Chat | 50.46 | 45.89 | 70.995 | 70.88 | 71.11 | 62.95 | 14.02 | 28.34 | 36.54 | 3.88 | 37.43 | 84.89 | 70.39 | 74.6* |
Baichuan2-7B-Chat | 44.68 | 42.74 | 53.39 | 53.28 | 53.5 | 53 | 21.34 | 32.32 | 25.25 | 6.32 | 37.46 | 79.63 | 60.15 | 69.23* |
Deepseek-7B-chat | 49.34 | 49.56 | 48.335 | 46.95 | 49.72 | 51.67 | 40.85 | 48.48 | 48.52 | 4.26 | 35.7 | 76.85 | 63.05 | 76.68* |
Llama2-7B-Chat | 38.16 | 39.17 | 33.59 | 34.54 | 32.64 | 47.64 | 14.02 | 27.4 | 21.15 | 2.08 | 35.54 | 74.28 | 54.78 | 75.65* |
MiniCPM-2B | 52.33 | 52.6 | 51.1 | 51.13 | 51.07 | 53.46 | 50.00 | 47.31 | 53.83 | 10.24 | 36.87 | 85.44 | 68.00 | 68.25 |
Model | MT-bench |
---|---|
GPT-4-turbo | 9.32 |
GPT-3.5-turbo | 8.39 |
Mistral-8*7b-Instruct-v0.1 | 8.30 |
Claude-2.1 | 8.18 |
Zephyr-7B-beta | 7.34 |
MiniCPM-2B | 7.25 |
Vicuna-33B | 7.12 |
Zephyr-7B-alpha | 6.88 |
LLaMA-2-70B-chat | 6.86 |
Mistral-7B-Instruct-v0.1 | 6.84 |
MPT-34B-instruct | 6.39 |
Using the following command can launch the gradio-based demo.
# generation powered by vllm
python demo/minicpm/vllm_based_demo.py --model_path <vllmcpm_repo_path>
# generation powered by huggingface
python demo/minicpm/hf_based_demo.py --model_path <hf_repo_path>
Install transformers>=4.36.0
and accelerate
run the following python code:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)
path = 'openbmb/MiniCPM-2B-dpo-bf16'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
responds, history = model.chat(tokenizer, "Which city is the capital of China?", temperature=0.8, top_p=0.8)
print(responds)
To facilitate ease of use, we have converted the model weights of MiniCPM to adapt to the structure of the LLaMA model:
import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<User>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responses = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responses = tokenizer.decode(responses[0], skip_special_tokens=True)
print(responses)
Install vLLM.
pip install "vllm>=0.4.1"
See here for the inference code.
Install SGLang.
python -m sglang.launch_server --model-path openbmb/MiniCPM-2B-dpo-fp16 --trust-remote-code --port 30000
from sglang import function, gen, set_default_backend, RuntimeEndpoint
@function
def text_qa(s, question):
s += "<User>" + question + "<AI>"
s += gen("answer", max_tokens=1024, temperature=0.7, top_p=0.7)
set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = text_qa.run(
question="What is the capital of China?",
)
print(state["answer"])
We have supported inference with llama.cpp, ollama, fastllm, mlx_lm. Thanks to @runfuture for the adaptation of llama.cpp and ollama.
Please refer to Edge Deployment Tutorial.
Please refer to Quantization Tutorial.
This project is developed by the following institutions:
@misc{hu2024minicpm,
title={MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies},
author={Shengding Hu and Yuge Tu and Xu Han and Chaoqun He and Ganqu Cui and Xiang Long and Zhi Zheng and Yewei Fang and Yuxiang Huang and Weilin Zhao and Xinrong Zhang and Zheng Leng Thai and Kaihuo Zhang and Chongyi Wang and Yuan Yao and Chenyang Zhao and Jie Zhou and Jie Cai and Zhongwu Zhai and Ning Ding and Chao Jia and Guoyang Zeng and Dahai Li and Zhiyuan Liu and Maosong Sun},
year={2024},
eprint={2404.06395},
archivePrefix={arXiv},
primaryClass={cs.CL}
}