Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
APACHE-2.0 License
[!IMPORTANT]
bigdl-llm
has now becomeipex-llm
(see the migration guide here); you may find the originalBigDL
project here.
IPEX-LLM
is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency[^1].
[!NOTE]
- It is built on top of the excellent work of
llama.cpp
,transformers
,bitsandbytes
,vLLM
,qlora
,AutoGPTQ
,AutoAWQ
, etc.- It provides seamless integration with llama.cpp, Ollama, Text-Generation-WebUI, HuggingFace transformers, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc.
- 50+ models have been optimized/verified on
ipex-llm
(including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
ipex-llm
on Intel GPU.ipex-llm
now supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here.ipex-llm
inference, serving and finetuning using the Docker images.ipex-llm
on Windows using just "one command".ipex-llm
; see the quickstart here.llama.cpp
and ollama
with ipex-llm
; see the quickstart here.ipex-llm
now supports Llama 3 on both Intel GPU and CPU.ipex-llm
now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU.bigdl-llm
has now become ipex-llm
(see the migration guide here); you may find the original BigDL
project here.ipex-llm
now supports directly loading model from ModelScope (魔搭).ipex-llm
added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.ipex-llm
through Text-Generation-WebUI GUI.ipex-llm
now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.ipex-llm
now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).ipex-llm
QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).ipex-llm
now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").ipex-llm
now supports Mixtral-8x7B on both Intel GPU and CPU.ipex-llm
now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").ipex-llm
now supports FP8 and FP4 inference on Intel GPU.ipex-llm
is available.ipex-llm
now supports vLLM continuous batching on both Intel GPU and CPU.ipex-llm
now supports QLoRA finetuning on both Intel GPU and CPU.ipex-llm
now supports FastChat serving on on both Intel CPU and GPU.ipex-llm
now supports Intel GPU (including iGPU, Arc, Flex and MAX).ipex-llm
tutorial is released.ipex-llm
PerformanceSee the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below[^1] (and refer to [2][3][4] for more details).
You may follow the Benchmarking Guide to run ipex-llm
performance benchmark yourself.
ipex-llm
DemoSee demos of running local LLMs on Intel Iris iGPU, Intel Core Ultra iGPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm
below.
Please see the Perplexity result below (tested on Wikitext dataset using the script here).
Perplexity | sym_int4 | q4_k | fp6 | fp8_e5m2 | fp8_e4m3 | fp16 |
---|---|---|---|---|---|---|
Llama-2-7B-chat-hf | 6.364 | 6.218 | 6.092 | 6.180 | 6.098 | 6.096 |
Mistral-7B-Instruct-v0.2 | 5.365 | 5.320 | 5.270 | 5.273 | 5.246 | 5.244 |
Baichuan2-7B-chat | 6.734 | 6.727 | 6.527 | 6.539 | 6.488 | 6.508 |
Qwen1.5-7B-chat | 8.865 | 8.816 | 8.557 | 8.846 | 8.530 | 8.607 |
Llama-3.1-8B-Instruct | 6.705 | 6.566 | 6.338 | 6.383 | 6.325 | 6.267 |
gemma-2-9b-it | 7.541 | 7.412 | 7.269 | 7.380 | 7.268 | 7.270 |
Baichuan2-13B-Chat | 6.313 | 6.160 | 6.070 | 6.145 | 6.086 | 6.031 |
Llama-2-13b-chat-hf | 5.449 | 5.422 | 5.341 | 5.384 | 5.332 | 5.329 |
Qwen1.5-14B-Chat | 7.529 | 7.520 | 7.367 | 7.504 | 7.297 | 7.334 |
[^1]: Performance varies by use, configuration and other factors. ipex-llm
may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.
ipex-llm
Quickstartllama.cpp
, ollama
, OpenWebUI
, etc., with ipex-llm
on Intel GPUtransformers
, LangChain
, LlamaIndex
, ModelScope
, etc. with ipex-llm
on Intel GPUvLLM
serving with ipex-llm
on Intel GPUvLLM
serving with ipex-llm
on Intel CPUFastChat
serving with ipex-llm
on Intel GPUipex-llm
applications in Python using VSCode on Intel GPUipex-llm
as an accelerated backend for llama.cpp
) on Intel GPUipex-llm
as an accelerated backend for ollama
) on Intel GPUllama.cpp
and ollama
: running Llama 3 on Intel GPU using llama.cpp
and ollama
with ipex-llm
ipex-llm
in vLLM on both Intel GPU and CPU
ipex-llm
in FastChat serving on on both Intel GPU and CPUipex-llm
serving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPIipex-llm
in oobabooga
WebUI
ipex-llm
in Axolotl for LLM finetuningipex-llm
on Intel CPU and GPUGraphRAG
using local LLM with ipex-llm
RAGFlow
(an open-source RAG engine) with ipex-llm
LangChain-Chatchat
(Knowledge Base QA using RAG pipeline) with ipex-llm
Continue
(coding copilot in VSCode) with ipex-llm
Open WebUI
with ipex-llm
PrivateGPT
to interact with documents with ipex-llm
ipex-llm
in Dify
(production-ready LLM app development platform)ipex-llm
on Windows with Intel GPUipex-llm
on Linux with Intel GPUipex-llm
low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.)ipex-llm
ipex-llm
ipex-llm
Over 50 models have been optimized/verified on ipex-llm
, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
Model | CPU Example | GPU Example |
---|---|---|
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) | link1, link2 | link |
LLaMA 2 | link1, link2 | link |
LLaMA 3 | link | link |
LLaMA 3.1 | link | link |
LLaMA 3.2 | link | |
ChatGLM | link | |
ChatGLM2 | link | link |
ChatGLM3 | link | link |
GLM-4 | link | link |
GLM-4V | link | link |
Mistral | link | link |
Mixtral | link | link |
Falcon | link | link |
MPT | link | link |
Dolly-v1 | link | link |
Dolly-v2 | link | link |
Replit Code | link | link |
RedPajama | link1, link2 | |
Phoenix | link1, link2 | |
StarCoder | link1, link2 | link |
Baichuan | link | link |
Baichuan2 | link | link |
InternLM | link | link |
InternVL2 | link | |
Qwen | link | link |
Qwen1.5 | link | link |
Qwen2 | link | link |
Qwen2.5 | link | |
Qwen-VL | link | link |
Qwen2-VL | link | |
Qwen2-Audio | link | |
Aquila | link | link |
Aquila2 | link | link |
MOSS | link | |
Whisper | link | link |
Phi-1_5 | link | link |
Flan-t5 | link | link |
LLaVA | link | link |
CodeLlama | link | link |
Skywork | link | |
InternLM-XComposer | link | |
WizardCoder-Python | link | |
CodeShell | link | |
Fuyu | link | |
Distil-Whisper | link | link |
Yi | link | link |
BlueLM | link | link |
Mamba | link | link |
SOLAR | link | link |
Phixtral | link | link |
InternLM2 | link | link |
RWKV4 | link | |
RWKV5 | link | |
Bark | link | link |
SpeechT5 | link | |
DeepSeek-MoE | link | |
Ziya-Coding-34B-v1.0 | link | |
Phi-2 | link | link |
Phi-3 | link | link |
Phi-3-vision | link | link |
Yuan2 | link | link |
Gemma | link | link |
Gemma2 | link | |
DeciLM-7B | link | link |
Deepseek | link | link |
StableLM | link | link |
CodeGemma | link | link |
Command-R/cohere | link | link |
CodeGeeX2 | link | link |
MiniCPM | link | link |
MiniCPM3 | link | |
MiniCPM-V | link | |
MiniCPM-V-2 | link | link |
MiniCPM-Llama3-V-2_5 | link | |
MiniCPM-V-2_6 | link | link |