An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
MIT License
Bot releases are hidden (Show)
Sharded checkpoints can now be loaded in the from_quantized
method.
Gemma model can be quantized with AutoGPTQ.
Full Changelog: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.7.0...v0.7.1
Published by fxmarty 8 months ago
@efrantar, GPTQ author, released Marlin, an optimized CUDA kernel for Ampere GPUs for int4*fp16 matrix multiplication, with per-group symmetric quantization support (without act-order), which significantly outperforms other existing kernels when using batching.
This kernel can be used in AutoGPTQ when loading models with the use_marlin=True
argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")
prompt = "Is quantization a good compression technique?"
inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")
res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))
# Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]
#
# <s> Is quantization a good compression technique?
#
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.
A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark
Visual tables coming soon.
Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.
AWQ's original implementation adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).
Example:
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")
prompt = "Is quantization a good compression technique?"
inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")
res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))
# Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00, 1.18s/it]
#
# <s> Is quantization a good compression technique?
#
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.
These models can be quantized with AutoGPTQ.
Full Changelog: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.6.0...v0.7.0
Published by fxmarty 10 months ago
Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.5.1...v0.6.0
Published by fxmarty 11 months ago
Mainly fixes Windows support.
Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.5.0...v0.5.1
Published by fxmarty 12 months ago
The more performant GPTQ kernels from @turboderp's exllamav2 library are now available directly in AutoGPTQ, and are the default backend choice.
A comprehensive benchmark is available here.
This is experimental.
model
, use_safetensors defaults to True by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/383
adapter_name
for get_gptq_peft_model
with train_mode=True
by @alex4321 in https://github.com/PanQiWei/AutoGPTQ/pull/347
pack_model
by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/355
Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.2...v0.5.0
Published by fxmarty about 1 year ago
This patch release includes a major bugfix to have the exllama backend work with input length > 2048 through a reconfigurable buffer size:
from auto_gptq import exllama_set_max_input_length
...
model = exllama_set_max_input_length(model, 4096)
This patch tentatively includes the exllama kernels in the wheels for Windows.
Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.1...v0.4.2
Published by PanQiWei about 1 year ago
pytorch==2.0.0
but also pytorch>=2.0.0
can be used for llama fused attention.Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.0...v0.4.1
Published by PanQiWei about 1 year ago
static_groups=True
on quantization which can futher improve quantized model's performance and close the gap of PPL again un-quantized model.Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.3.2...v0.4.0
Published by PanQiWei about 1 year ago
desc_act
and group_size
can't be used togetherperplexity_utils.py
to gracefully calculate PPL so that the result can be used to compare with other libraries fairlysave_dir
argument from from_quantized
model, and now only model_name_or_path
argument is supported in this methodrevision
and other huggingface_hub kwargs in .from_quantized() by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/205
Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.3.0...v0.3.2
Published by PanQiWei over 1 year ago
Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.2.1...v0.3.0
Published by PanQiWei over 1 year ago
autogptq_cuda
dir missed in distribution filePublished by PanQiWei over 1 year ago
Fix the problem that installation from pypi failed when the environment variable CUDA_VERSION
is set.
Published by PanQiWei over 1 year ago
Happy International Children's Day! 🎈 At the age of LLMs and the dawn of AGI, may we always be curious like children, with vigorous energy and courage to explore the bright future.
There are bunch of new features been added in this version:
llama
and gptj
, fused mlp for llama
codegen
, gpt_bigcode
and falcon
Below are the detailed change log:
device_map
by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/80
model(tokens)
syntax by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/84
push_to_hub
by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/91
Following are new contributors and their first pr. Thank you very much for your love of auto_gptq
and contributions! ❤️
Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.1.0...v0.2.0
Published by PanQiWei over 1 year ago
Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.0.5...v0.1.0
Published by PanQiWei over 1 year ago
Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.0.4...v0.0.5
Published by PanQiWei over 1 year ago
triton
is officially supported start from this version!pip install auto-gptq
is supported start from this version!Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.0.3...v0.0.4
Published by PanQiWei over 1 year ago
Published by PanQiWei over 1 year ago
eval_tasks
module to support evaluate model's performance on predefined down-stream tasks before and after quantizationLLaMa
modelposition_ids