LLM Inference benchmark
MIT License
LLM Inference benchmark
Framework | Producibility**** | Docker Image | API Server | OpenAI API Server | WebUI | Multi Models** | Multi-node | Backends | Embedding Model |
---|---|---|---|---|---|---|---|---|---|
text-generation-webui | Low | Yes | Yes | Yes | Yes | No | No | Transformers/llama.cpp/ExLlama/ExLlamaV2/AutoGPTQ/AutoAWQ/GPTQ-for-LLaMa/CTransformers | No |
OpenLLM | High | Yes | Yes | Yes | No | With BentoML | With BentoML | Transformers(int8,int4,gptq), vLLM(awq/squeezellm), TensorRT | No |
vLLM* | High | Yes | Yes | Yes | No | No | Yes(With Ray) | vLLM | No |
Xinference | High | Yes | Yes | Yes | Yes | Yes | Yes | Transformers/vLLM/TensorRT/GGML | Yes |
TGI*** | Medium | Yes | Yes | No | No | No | No | Transformers/AutoGPTQ/AWQ/EETP/vLLM/ExLlama/ExLlamaV2 | No |
ScaleLLM | Medium | Yes | Yes | Yes | Yes | No | No | Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2 | No |
FastChat | High | Yes | Yes | Yes | Yes | Yes | Yes | Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2 | Yes |
Backend | Device | Compatibility** | PEFT Adapters* | Quatisation | Batching | Distributed Inference | Streaming |
---|---|---|---|---|---|---|---|
Transformers | GPU | High | Yes | bitsandbytes(int8/int4), AutoGPTQ(gptq), AutoAWQ(awq) | Yes | accelerate | Yes |
vLLM | GPU | High | No | awq/squeezellm | Yes | Yes | Yes |
ExLlamaV2 | GPU/CPU | Low | No | GPTQ | Yes | Yes | Yes |
TensorRT | GPU | Medium | No | some models | Yes | Yes | Yes |
Candle | GPU/CPU | Low | No | No | Yes | Yes | Yes |
CTranslate2 | GPU | Low | No | Yes | Yes | Yes | Yes |
TGI | GPU | Medium | Yes | awq/eetq/gptq/bitsandbytes | Yes | Yes | Yes |
llama-cpp*** | GPU/CPU | High | No | GGUF/GPTQ | Yes | No | Yes |
lmdeploy | GPU | Medium | No | AWQ | Yes | Yes | Yes |
Deepspeed-FastGen | GPU | Low | No | No | Yes | Yes | Yes |
Hardware:
Software:
Model:
Data:
Backend | TPS@4 | QPS@4 | TPS@1 | QPS@1 | FTL@1 |
---|---|---|---|---|---|
text-generation-webui Transformer | 40.39 | 0.15 | 41.47 | 0.21 | 344.61 |
text-generation-webui Transformer with flash-attention-2 | 58.30 | 0.21 | 43.52 | 0.21 | 341.39 |
text-generation-webui ExllamaV2 | 69.09 | 0.26 | 50.71 | 0.27 | 564.80 |
OpenLLM PyTorch | 60.79 | 0.22 | 44.73 | 0.21 | 514.55 |
TGI | 192.58 | 0.90 | 59.68 | 0.28 | 82.72 |
vLLM | 222.63 | 1.08 | 62.69 | 0.30 | 95.43 |
TensorRT | - | - | - | - | - |
CTranslate2* | - | - | - | - | - |
lmdeploy | 236.03 | 1.15 | 67.86 | 0.33 | 76.81 |
bs: Batch Size. bs=4
indicates the batch size is 4.
TPS: Tokens Per Second.
QPS: Queries Per Second.
FTL: First Token Latency, measured in milliseconds. Applicable only in stream mode.
Encountered an error using CTranslate2 to convert Yi-6B-Chat. See details in the issue.
Backend | TPS@4 | QPS@4 | TPS@1 | QPS@1 | FTL@1 |
---|---|---|---|---|---|
TGI eetq 8bit | 293.08 | 1.41 | 88.08 | 0.42 | 63.69 |
TGI GPTQ 8bit | - | - | - | - | - |
OpenLLM PyTorch AutoGPTQ 8bit | 49.8 | 0.17 | 29.54 | 0.14 | 930.16 |
Backend | TPS@4 | QPS@4 | TPS@1 | QPS@1 | FTL@1 |
---|---|---|---|---|---|
TGI AWQ 4bit | 336.47 | 1.61 | 102.00 | 0.48 | 94.84 |
vLLM AWQ 4bit | 29.03 | 0.14 | 37.48 | 0.19 | 3711.0 |
text-generation-webui llama-cpp GGUF 4bit | 67.63 | 0.37 | 56.65 | 0.34 | 331.57 |