AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

MIT License

Downloads
742
Stars
4.3K
Committers
32

Bot releases are hidden (Show)

AutoGPTQ - v0.7.1: patch release Latest Release

Published by fxmarty 8 months ago

Support loading sharded quantized checkpoints

Sharded checkpoints can now be loaded in the from_quantized method.

Gemma GPTQ quantization

Gemma model can be quantized with AutoGPTQ.

Other changes and fixes

Full Changelog: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.7.0...v0.7.1

AutoGPTQ - v0.7.0: Marlin int4*fp16 kernel, AWQ checkpoints loading

Published by fxmarty 8 months ago

Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading

@efrantar, GPTQ author, released Marlin, an optimized CUDA kernel for Ampere GPUs for int4*fp16 matrix multiplication, with per-group symmetric quantization support (without act-order), which significantly outperforms other existing kernels when using batching.

This kernel can be used in AutoGPTQ when loading models with the use_marlin=True argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:

import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

# Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]
# 
# <s> Is quantization a good compression technique?
# 
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.

A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark

Visual tables coming soon.

Ability to load AWQ checkpoints in AutoGPTQ

Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.

AWQ's original implementation adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).

Example:

import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

# Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00,  1.18s/it]
# 
# <s> Is quantization a good compression technique?
# 
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.

Qwen2, LongLLaMA, Deci_lm models support

These models can be quantized with AutoGPTQ.

Other changes and bugfixes

New Contributors

Full Changelog: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.6.0...v0.7.0

What's Changed

New Contributors

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.5.1...v0.6.0

AutoGPTQ - v0.5.1: Patch release

Published by fxmarty 11 months ago

Mainly fixes Windows support.

What's Changed

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.5.0...v0.5.1

AutoGPTQ - v0.5.0: Exllama v2 GPTQ kernels, RoCm 5.6/5.7 support, many bugfixes

Published by fxmarty 12 months ago

Exllama v2 GPTQ kernel support

The more performant GPTQ kernels from @turboderp's exllamav2 library are now available directly in AutoGPTQ, and are the default backend choice.

A comprehensive benchmark is available here.

CPU inference support

This is experimental.

Loading from safetensors is now the default

Falcon, Mistral support

Other changes and bugfixes

New Contributors

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.2...v0.5.0

AutoGPTQ - v0.4.2: Patch release

Published by fxmarty about 1 year ago

Major bugfix: exllama backend with arbitrary input length

This patch release includes a major bugfix to have the exllama backend work with input length > 2048 through a reconfigurable buffer size:

from auto_gptq import exllama_set_max_input_length

...
model = exllama_set_max_input_length(model, 4096)

Exllama kernels support in Windows wheels

This patch tentatively includes the exllama kernels in the wheels for Windows.

What's Changed

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.1...v0.4.2

AutoGPTQ - v0.4.1: Patch Fix

Published by PanQiWei about 1 year ago

Overview

  • Fix typo so not only pytorch==2.0.0 but also pytorch>=2.0.0 can be used for llama fused attention.
  • Patch exllama QuantLinear to avoid modifying the state dict to make the integration with transformers smoother.

Change Log

What's Changed

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.0...v0.4.1

AutoGPTQ - v0.4.0

Published by PanQiWei about 1 year ago

Overview

  • New platform: support ROCm platform (5.4.2 for now, and will extend to 5.5 and 5.6 as soon as pytorch officially release 2.1.0).
  • New kernels: support exllama q4 kernels to get at least 1.3x inference speedup.
  • New quantization strategy: support to specify static_groups=True on quantization which can futher improve quantized model's performance and close the gap of PPL again un-quantized model.
  • New model: qwen

Full Change Log

What's Changed

New Contributors

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.3.2...v0.4.0

AutoGPTQ - v0.3.2: Patch Fix

Published by PanQiWei about 1 year ago

Overview

  • Fix CUDA kernel bug that cause desc_act and group_size can't be used together
  • Improve user experience of manually installation
  • Improve user experience of loading quantized model
  • Add perplexity_utils.py to gracefully calculate PPL so that the result can be used to compare with other libraries fairly
  • Remove save_dir argument from from_quantized model, and now only model_name_or_path argument is supported in this method

Full Change Log

What's Changed

New Contributors

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.3.0...v0.3.2

AutoGPTQ - v0.3.0

Published by PanQiWei over 1 year ago

Overview

  • CUDA kernels improvement: support models whose hidden_size can only divisible by 32/64 instead of 256.
  • Peft integration: support training and inference using LoRA, AdaLoRA, AdaptionPrompt, etc.
  • New models: BaiChuan, InternLM.
  • Other updates: see 'Full Change Log' below for details.

Full Change Log

What's Changed

New Contributors

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.2.1...v0.3.0

AutoGPTQ - v0.2.2: Patch Release

Published by PanQiWei over 1 year ago

  • fix autogptq_cuda dir missed in distribution file
AutoGPTQ - v0.2.1: Patch Release

Published by PanQiWei over 1 year ago

Fix the problem that installation from pypi failed when the environment variable CUDA_VERSION is set.

AutoGPTQ - v0.2.0

Published by PanQiWei over 1 year ago

Happy International Children's Day! 🎈 At the age of LLMs and the dawn of AGI, may we always be curious like children, with vigorous energy and courage to explore the bright future.

Features Summary

There are bunch of new features been added in this version:

  • Optimized modules for faster inference speed: fused attention for llama and gptj, fused mlp for llama
  • Full CPU offloading
  • Multiple GPUs inference with triton backend
  • Three new models are supported: codegen, gpt_bigcode and falcon
  • Support download/upload quantized model from/to HF Hub

Change Log

Below are the detailed change log:

New Contributors

Following are new contributors and their first pr. Thank you very much for your love of auto_gptq and contributions! ❤️

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.1.0...v0.2.0

AutoGPTQ - v0.1.0

Published by PanQiWei over 1 year ago

What's Changed

New Contributors

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.0.5...v0.1.0

AutoGPTQ - v0.0.5

Published by PanQiWei over 1 year ago

What's Changed

New Contributors

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.0.4...v0.0.5

AutoGPTQ - v0.0.4

Published by PanQiWei over 1 year ago

Big News

  • triton is officially supported start from this version!
  • quick install from pypi using pip install auto-gptq is supported start from this version!

What's Changed

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.0.3...v0.0.4

AutoGPTQ - v0.0.3

Published by PanQiWei over 1 year ago

What's Changed

  • fix typo in README.md
  • fix problem that can't get some models' max sequence length
  • fix problem that some models have more required positional arguments when forward in transformer layers
  • fix mismatch GPTNeoxForCausalLM's lm_head

New Contributors

AutoGPTQ - v0.0.2

Published by PanQiWei over 1 year ago

  • added eval_tasks module to support evaluate model's performance on predefined down-stream tasks before and after quantization
  • fixed some bugs when using LLaMa model
  • fixed some bugs when using models that required position_ids