README FOR ENGLISH

总述

背景介绍

介绍本工作是 NVIDIA TensorRT Hackathon 2023 的参赛题目，本项目使用TRT-LLM完成对Qwen-7B-Chat实现推理加速。相关代码已经放在release/0.1.0分支，感兴趣的同学可以去该分支学习完整流程。

自2024年4月24日起，TensorRT-LLM官方仓库最新main分支已经支持qwen/qwen2，故本仓库不再做重大更新。

功能概述

FP16 / BF16(实验性)
INT8 Weight-Only & INT8 Smooth Quant & INT4 Weight-Only & INT4-AWQ & INT4-GPTQ
INT8 KV CACHE
Tensor Parallel（多卡并行）
基于gradio搭建web demo
支持triton部署api，结合inflight_batching实现最大吞吐/并发。
支持fastapi搭建兼容openai请求的api，并且支持function call调用。
支持cli命令行对话。
支持langchain接入。

支持的模型：qwen2（推荐）/qwen（当前仅维护到0.7.0）/qwen-vl（当前仅维护到0.7.0）

base模型（实验性）：Qwen1.5-0.5B、Qwen1.5-1.8B、Qwen1.5-4B、Qwen1.5-7B、Qwen1.5-14B、Qwen1.5-32B、Qwen1.5-72B、QWen-VL、CodeQwen1.5-7B
chat模型（推荐）：Qwen1.5-0.5B-Chat、Qwen1.5-1.8B-Chat、Qwen1.5-4B-Chat、Qwen1.5-7B-Chat、Qwen1.5-14B-Chat、Qwen1.5-32B-Chat、Qwen1.5-72B-Chat（实验性）、QWen-VL-Chat、CodeQwen1.5-7B-Chat
chat-gptq-int4模型：Qwen1.5-0.5B-Chat-GPTQ-Int4、Qwen1.5-1.8B-Chat-GPTQ-Int4、Qwen1.5-4B-Chat-GPTQ-Int4、Qwen1.5-7B-Chat-GPTQ-Int4、Qwen1.5-14B-Chat-GPTQ-Int4、Qwen1.5-32B-Chat-GPTQ-Int4、Qwen1.5-72B-Chat-GPTQ-Int4（实验性）、Qwen-VL-Chat-Int4

软硬件要求

Linux最佳，已安装docker，并且安装了nvidia-docker（安装指南），Windows理论也可以，但是还未测试，感兴趣可以自己研究一下。
Windows参考这个教程：链接
有英伟达显卡（30系，40系，V100/A100等），以及一定的显存、内存、磁盘。结合Qwen官方推理要求，预估出下面的要求，详见表格（仅编译期最大要求），仅供参考：

快速入门

准备工作

下载镜像。
- 官方triton镜像24.02，对应TensorRT-LLM版本为0.8.0，不含TensorRT-LLM开发包。
```
docker pull nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
```
- 对于Windows用户想体验tritonserver部署的，或者无GPU的用户，可以使用AutoDL镜像，含tritonserver，版本为24.02（对应tensorrt_llm 0.8.0)，链接，注：该链接包含完整编译教程。

拉取本项目代码

git clone https://github.com/Tlntin/Qwen-TensorRT-LLM.git
cd Qwen-TensorRT-LLM

进入项目目录，然后创建并启动容器，同时将本地examples代码路径映射到/app/tensorrt_llm/examples路径，然后打开8000和7860端口的映射，方便调试api和web界面。

docker run --gpus all \
  --name trt_llm \
  -d \
  --ipc=host \
  --ulimit memlock=-1 \
  --restart=always \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -p 7860:7860 \
  -v ${PWD}/examples:/app/tensorrt_llm/examples \
  nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 sleep 8640000

进入docker容器里面的qwen2路径，
- 使用pip直接安装官方编译好的tensorrt_llm，需要先安装numpy1.x,不兼容numpy2.x。
```
pip install "numpy<2"
pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121
```
- 安装提供的Python依赖
```
cd /app/tensorrt_llm/examples/qwen2/
pip install -r requirements.txt
```
- 升级transformers版本，qwen2最低需要4.37以上版本，如果有警告依赖不匹配可以忽略。
```
pip install "transformers>=4.37"
```
从HuggingFace下载模型（暂时不支持其他平台），例如QWen1.5-7B-Chat模型，然后将文件夹重命名为qwen1.5_7b_chat，最后放到examples/qwen2/路径下即可。
修改编译参数（可选）
- 默认编译参数，包括batch_size, max_input_len, max_new_tokens, seq_length都存放在default_config.py中
- 默认模型路径，包括hf_model_dir（模型路径）和tokenizer_dir（分词器路径）以及int4_gptq_model_dir（手动gptq量化输出路径），可以改成你自定义的路径。
- 对于24G显存用户，直接编译即可，默认是fp16数据类型，max_batch_size=2
- 对于低显存用户，可以降低max_batch_size=1，或者继续降低max_input_len, max_new_tokens

运行指南（fp16模型）

编译。
- 编译fp16（注：--remove_input_padding和--enable_context_fmha为可选参数，可以一定程度上节省显存）。
```
python3 build.py --remove_input_padding --enable_context_fmha
```
- 编译 int8 (weight only)。
```
python3 build.py --use_weight_only --weight_only_precision=int8
```
- 编译int4 (weight only)
```
python3 build.py --use_weight_only --weight_only_precision=int4
```
- 对于如果单卡装不下，又不想用int4/int8量化，可以选择尝试tp = 2，即启用两张GPU进行编译（注：tp功能目前只支持从Huggingface格式构建engine）
```
python3 build.py --world_size 2 --tp_size 2
```
运行。编译完后，再试跑一下，输出Output: "您好，我是来自达摩院的大规模语言模型，我叫通义千问。"这说明成功。
- tp = 1（默认单GPU）时使用python直接运行run.py
```
python3 run.py
```
- tp = 2（2卡用户，或者更多GPU卡）时，使用mpirun命令来运行run.py
```
mpirun -n 2 --allow-run-as-root python run.py
```
- 使用官方24.02容器多卡可能会报错，提示：Failed, NCCL error /home/jenkins/agent/workspace/LLM/release-0.8/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code',需要安装nccl2.20.3-1（使用压缩包，解压后导入系统环境变量或者使用apt命名安装均可），安装后即可正常运行。
```
export LD_LIBRARY_PATH=nccl_2.20.3-1+cuda12.3_x86_64/lib/:$LD_LIBRARY_PATH
# 或者，推荐下面这种
apt update && apt-get install -y --no-install-recommends libnccl2=2.20.3-1+cuda12.3 libnccl-dev=2.20.3-1+cuda12.3 -y
```
验证模型精度。可以试试跑一下summarize.py，对比一下huggingface和trt-llm的rouge得分。这一步需要在线下载数据集，对于网络不好的用户，可以参考该方法：datasets离线加载huggingface数据集方法
- 跑hugggingface版
```
python3 summarize.py --test_hf
```
- 跑trt-llm版
```
python3 summarize.py --test_trt_llm
```
- 一般来说，如果trt-llm的rouge分数和huggingface差不多，略低一些（1以内）或者略高一些（2以内），则说明精度基本对齐。
测量模型吞吐速度和生成速度。需要下载ShareGPT_V3_unfiltered_cleaned_split.json这个文件。
- 可以通过wget/浏览器直接下载，下载链接
- 也可通过百度网盘下载，链接: https://pan.baidu.com/s/12rot0Lc0hc9oCb7GxBS6Ng?pwd=jps5 提取码: jps5
- 下载后同样放到examples/qwen2/路径下即可
- 测量前，如果需要改max_input_length/max_new_tokens，可以直接改default_config.py即可。一般不推荐修改，如果修改了这个，则需要重新编译一次trt-llm，保证两者输入数据集长度统一。
- 测量huggingface模型
```
python3 benchmark.py --backend=hf --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --hf_max_batch_size=1
```
- 测量trt-llm模型 (注意：--trt_max_batch_size不应该超过build时候定义的最大batch_size，否则会出现内存错误。)
```
python3 benchmark.py --backend=trt_llm --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --trt_max_batch_size=1
```

运行指南（Smooth Quant）(强烈推荐)

注意：运行Smooth Quant需要将huggingface模型完全加载到GPU里面，用于构建int8标定数据集，所以需要提前确保你的显存够大，能够完全加载整个模型。
将Huggingface格式的数据转成FT(FastTransformer)需要的数据格式，这一步需要在线下载数据集，对于网络不好的用户，可以参考该方法：datasets离线加载huggingface数据集方法
- 单卡
```
python3 hf_qwen_convert.py --smoothquant=0.5
```
- 多卡（以2卡为例）
```
python3 hf_qwen_convert.py --smoothquant=0.5 --tensor-parallelism=2
```

开始编译trt_engine

单卡

python3 build.py --use_smooth_quant --per_token --per_channel

多卡（以2卡为例）

python3 build.py --use_smooth_quant --per_token --per_channel --world_size 2 --tp_size 2

编译完成，run/summarize/benchmark等等都和上面的是一样的了。

运行指南（int8-kv-cache篇）

注意：运行int8-kv-cache需要将huggingface模型完全加载到GPU里面，用于构建int8标定数据集，所以需要提前确保你的显存够大，能够完全加载整个模型。

将Huggingface格式的数据转成FT(FastTransformer)需要的数据格式。

单卡

python3 hf_qwen_convert.py --calibrate-kv-cache

多卡（以2卡为例）

python3 hf_qwen_convert.py --calibrate-kv-cache --tensor-parallelism=2

编译int8 weight only + int8-kv-cache

单卡

python3 build.py --use_weight_only --weight_only_precision=int8 --int8_kv_cache

多卡（以2卡为例）

python3 build.py --use_weight_only --weight_only_precision=int8 --int8_kv_cache --world_size 2 --tp_size 2

运行指南（int4-gptq篇）

需要安装auto-gptq模块，并且升级transformers模块版本到最新版（建议optimum和transformers都用最新版，否则可能有乱码问题），参考issue/68。（注：安装完模块后可能会提示tensorrt_llm与其他模块版本不兼容，可以忽略该警告）
```
pip install auto-gptq optimum
pip install transformers -U
```

手动获取标定权重（可选）

转权重获取scale相关信息，默认使用GPU进行校准，需要能够完整加载模型。（注：对于Qwen-7B-Chat V1.0，可以加上--device=cpu来尝试用cpu标定，但是时间会很长）
```
python3 gptq_convert.py
```

编译TensorRT-LLM Engine

python build.py --use_weight_only \
          --weight_only_precision int4_gptq \
          --per_group

如果想要节省显存（注：只能用于单batch），可以试试加上这俩参数来编译Engine

python build.py --use_weight_only \
          --weight_only_precision int4_gptq \
          --per_group \
          --remove_input_padding \
          --enable_context_fmha

使用官方int4权重，例如Qwen-xx-Chat-Int4模型（推荐）

编译模型，注意设置hf模型路径和--quant_ckpt_path量化后权重路径均设置为同一个路径，下面是32b-gptq-int4模型的示例（其他gptq-int4模型也是一样操作）

python build.py --use_weight_only \
          --weight_only_precision int4_gptq \
          --per_group \
          --hf_model_dir Qwen1.5-32B-Chat-GPTQ-Int4 \
          --quant_ckpt_path Qwen1.5-32B-Chat-GPTQ-Int4

运行模型，这里需要指定一下tokenizer路径

python3 run.py --tokenizer_dir=Qwen1.5-32B-Chat-GPTQ-Int4

运行指南（int4-awq篇）

需要下载并安装nvidia-ammo模块（仅支持Linux，不支持Windows）

pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo~=0.7.0

运行int4-awq量化代码，导出校准权重。

python3 quantize.py --export_path ./qwen2_7b_4bit_gs128_awq.pt

运行build.py，用于构建TensorRT-LLM Engine。

python build.py --use_weight_only \
                --weight_only_precision int4_awq \
                --per_group \
                --quant_ckpt_path ./qwen2_7b_4bit_gs128_awq.pt

如果想要节省显存（注：只能用于单batch），可以试试加上这俩参数来编译Engine

python build.py --use_weight_only \
                --weight_only_precision int4_awq \
                --per_group \
                --remove_input_padding \
                --enable_context_fmha \
                --quant_ckpt_path ./qwen2_7b_4bit_gs128_awq.pt

进阶工作

参考该教程部署tritonserver：Triton24.02部署TensorRT-LLM,实现http查询
使用该项目封装tritonserver以支持openai API格式，项目链接：https://github.com/zhaohb/fastapi_tritonserver

Stargazers over time

Package Rankings

Top 6.74% on Proxy.golang.org

Related Projects

intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques fo...

11 Nov 2022 1,909

GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ

06 Mar 2023 2,986

pandallm

Panda项目是于2023年5月启动的开源海外中文大语言模型项目，致力于大模型时代探索整个技术栈，旨在推动中文自然语言处理领域的创新和合作。

28 Apr 2023 1,064

transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer ...

31 Oct 2021 1,644

minimind

【大模型】3小时完全从0训练一个仅有26M的小参数GPT，最低仅需2G显卡即可推理训练！

27 Jul 2024 2,087

sit4onnx

Tools for simple inference testing using TensorRT, CUDA and OpenVINO CPU/GPU and CPU providers. S...

12 May 2022 18

Qwen-TensorRT-LLM

README FOR ENGLISH

总述

背景介绍

自2024年4月24日起，TensorRT-LLM官方仓库最新main分支已经支持qwen/qwen2，故本仓库不再做重大更新。

功能概述

支持的模型：qwen2（推荐）/qwen（当前仅维护到0.7.0）/qwen-vl（当前仅维护到0.7.0）

相关教程：

软硬件要求

快速入门

准备工作

运行指南（fp16模型）

运行指南（Smooth Quant）(强烈推荐)

运行指南（int8-kv-cache篇）

运行指南（int4-gptq篇）

运行指南（int4-awq篇）

进阶工作

Stargazers over time

Related Projects

intel-extension-for-transformers

GPTQ-for-LLaMa

pandallm

transformer-deploy

minimind

sit4onnx