[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
MIT License
Efficient and accurate low-bit weight quantization (INT3/4) for LLMs, supporting instruction-tuned models and multi-modal LMs.
The current release supports:
Thanks to AWQ, TinyChat can deliver more efficient responses with LLM/VLM chatbots through 4-bit inference.
TinyChat also supports inference with vision language models (e.g., VILA, LLaVA). In the following examples, W4A16 quantized models from VILA family are launched with TinyChat.
Check out TinyChat, which offers a turn-key solution for on-device inference of LLMs and VLMs on resource-constrained edge platforms. With TinyChat, it is now possible to efficiently run large models on small and low-power devices even without Internet connection!
from_pretrained
. You can either load quantized models from the Hub or your own HF quantized models.git clone https://github.com/mit-han-lab/llm-awq
cd llm-awq
conda create -n awq python=3.10 -y
conda activate awq
pip install --upgrade pip # enable PEP 660 support
pip install -e .
For edge devices like Orin, before running the commands above, please:
conda create -n awq python=3.8 -y
for JetPack 5).cd awq/kernels
python setup.py install
git clone [email protected]:Efficient-Large-Model/VILA.git
cd VILA
pip install -e .
We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run:
# git lfs install # install git lfs if not already
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache
The detailed support list:
Models | Sizes | INT4-g128 | INT3-g128 |
---|---|---|---|
VILA-1.5 | 3B/8B/13B/40B | ||
Llama3 | 8B/70B | ||
VILA | 7B/13B | ||
Llama2 | 7B/13B/70B | ||
LLaMA | 7B/13B/30B/65B | ||
OPT | 125m/1.3B/2.7B/6.7B/13B/30B | ||
CodeLlama | 7B/13B/34B | ||
StarCoder | 15.5B | ||
Vicuna-v1.1 | 7B/13B | ||
LLaVA-v0 | 13B |
Note: We only list models that we have prepare the AWQ searching results in the table above. AWQ also supports models such as LLaVA-v1.5 7B, and you may need to run the AWQ search on your own to quantize these models.
AWQ can be easily applied to various LMs thanks to its good generalization, including instruction-tuned models and multi-modal LMs. It provides an easy-to-use tool to reduce the serving cost of LLMs.
Here we provide two examples of AWQ application: Vicuna-7B (chatbot) and LLaVA-13B (visual reasoning) under ./examples
directory. AWQ can easily reduce the GPU memory of model serving and speed up token generation. It provides accurate quantization, providing reasoning outputs. You should be able to observe memory savings when running the models with 4-bit weights.
Note that we perform AWQ using only textual calibration data, depsite we are running on multi-modal input. Please refer to ./examples
for details.
We provide several sample script to run AWQ (please refer to ./scripts
). We use Llama3-8B as an example.
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq awq_cache/llama3-8b-w4-g128.pt
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/llama3-8b-w4-g128.pt \
--q_backend fake
mkdir quant_cache
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/llama3-8b-w4-g128.pt \
--q_backend real --dump_quant quant_cache/llama3-8b-w4-g128-awq.pt
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_quant quant_cache/llama3-8b-w4-g128-awq.pt
AWQ also seamlessly supports large multi-modal models (LMMs). We demonstrate the results on the recent VILA-1.5 model family.
VILA-1.5-3B | VQA-v2 | GQA | VizWiz | ScienceQA | TextVQA | POPE | MME | MMBench | MMBench-CN | SEED |
---|---|---|---|---|---|---|---|---|---|---|
FP16 | 80.4 | 61.5 | 53.5 | 69.0 | 60.4 | 85.9 | 1442.4 | 63.4 | 52.7 | 60.9 |
AWQ-INT4 | 80.0 | 61.1 | 53.8 | 67.8 | 60.4 | 85.9 | 1437.3 | 63.3 | 51.4 | 59.8 |
VILA-1.5-8B | VQA-v2 | GQA | VizWiz | ScienceQA | TextVQA | POPE | MME | MMBench | MMBench-CN | SEED |
---|---|---|---|---|---|---|---|---|---|---|
FP16 | 80.9 | 61.9 | 58.7 | 79.9 | 66.3 | 84.4 | 1577.01 | 72.3 | 66.2 | 64.2 |
AWQ-INT4 | 80.3 | 61.7 | 59.3 | 79.0 | 65.4 | 82.9 | 1593.65 | 71.0 | 64.9 | 64.0 |
VILA-1.5-13B | VQA-v2 | GQA | VizWiz | ScienceQA | TextVQA | POPE | MME | MMBench | MMBench-CN | SEED |
---|---|---|---|---|---|---|---|---|---|---|
FP16 | 82.8 | 64.3 | 62.6 | 80.1 | 65.0 | 86.3 | 1569.55 | 74.9 | 66.3 | 65.1 |
AWQ-INT4 | 82.7 | 64.5 | 63.3 | 79.7 | 64.7 | 86.7 | 1531.35 | 74.7 | 66.7 | 65.1 |
VILA-1.5-40B | VQA-v2 | GQA | VizWiz | ScienceQA | TextVQA | POPE | MME | MMBench | MMBench-CN | SEED |
---|---|---|---|---|---|---|---|---|---|---|
FP16 | 84.3 | 64.6 | 62.2 | 87.2 | 73.6 | 87.3 | 1726.82 | 82.4 | 80.2 | 69.1 |
AWQ-INT4 | 84.1 | 64.4 | 61.3 | 86.7 | 73.2 | 88.2 | 1714.79 | 83.2 | 79.6 | 68.9 |
$~~~~~~$ | Precision | A100 | 4090 | Orin |
---|---|---|---|---|
VILA1.5-3B | fp16 | 104.6 | 137.6 | 25.4 |
VILA1.5-3B-AWQ | int4 | 182.8 | 215.5 | 42.5 |
VILA1.5-3B-S2 | fp16 | 104.3 | 137.2 | 24.6 |
VILA1.5-3B-S2-AWQ | int4 | 180.2 | 219.3 | 40.1 |
Llama-3-VILA1.5-8B | fp16 | 74.9 | 57.4 | 10.2 |
Llama-3-VILA1.5-8B-AWQ | int4 | 168.9 | 150.2 | 28.7 |
VILA1.5-13B | fp16 | 50.9 | OOM | 6.1 |
VILA1.5-13B-AWQ | int4 | 115.9 | 105.7 | 20.6 |
VILA1.5-40B | fp16 | OOM | OOM | -- |
VILA1.5-40B-AWQ | int4 | 57.0 | OOM | -- |
If you find AWQ useful or relevant to your research, please kindly cite our paper:
@inproceedings{lin2023awq,
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
booktitle={MLSys},
year={2024}
}
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers