Bot releases are hidden (Show)

AutoGPTQ - v0.7.1: patch release Latest Release

Published by fxmarty 8 months ago

Support loading sharded quantized checkpoints

Sharded checkpoints can now be loaded in the from_quantized method.

Support loading sharded quantized checkpoints. by @LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/425

Gemma GPTQ quantization

Gemma model can be quantized with AutoGPTQ.

Add support for Gemma models. by @LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/561

Other changes and fixes

Add back missing import by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/553
Fix bias materialization for Marlin by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/554
Fix shape check marlin by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/557
Explicitely check compute capability in marlin's QLinear by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/567
Compatibility with latest transformers by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/573

Full Changelog: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.7.0...v0.7.1

AutoGPTQ - v0.7.0: Marlin int4*fp16 kernel, AWQ checkpoints loading

Published by fxmarty 8 months ago

Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading

@efrantar, GPTQ author, released Marlin, an optimized CUDA kernel for Ampere GPUs for int4*fp16 matrix multiplication, with per-group symmetric quantization support (without act-order), which significantly outperforms other existing kernels when using batching.

This kernel can be used in AutoGPTQ when loading models with the use_marlin=True argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:

import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

# Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]
# 
# <s> Is quantization a good compression technique?
# 
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.

A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark

Visual tables coming soon.

add marlin kernel by @qwopqwop200 in https://github.com/AutoGPTQ/AutoGPTQ/pull/514
updated marlin serialization by @rib-2 in https://github.com/AutoGPTQ/AutoGPTQ/pull/522
Marlin repacking CUDA kernel by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/539
Marlin kernel can be built against any compute capability by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/540

Ability to load AWQ checkpoints in AutoGPTQ

Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.

AWQ's original implementation adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).

Example:

import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

# Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00,  1.18s/it]
# 
# <s> Is quantization a good compression technique?
# 
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.

Support inference with AWQ models by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/484

Qwen2, LongLLaMA, Deci_lm models support

These models can be quantized with AutoGPTQ.

Add qwen2 by @JustinLin610 in https://github.com/AutoGPTQ/AutoGPTQ/pull/519
Change deci_lm model type to deci by @LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/491
Support for LongLLaMA models. by @LaaZa in https://github.com/AutoGPTQ/AutoGPTQ/pull/442

Other changes and bugfixes

Update version & install instructions by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/485
fix the support of Qwen by @hzhwcmhf in https://github.com/AutoGPTQ/AutoGPTQ/pull/495
rocm6.0 compatible exllama by @seungrokj in https://github.com/AutoGPTQ/AutoGPTQ/pull/515
Untie weights for safetensors serialization by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/536
marlin update version 0.1.1 and fix marlin bug by @qwopqwop200 in https://github.com/AutoGPTQ/AutoGPTQ/pull/524
Use ruff for linting by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/537
Fix wheels build for torch==2.2.0 by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/541
Fix repo owners in workflows by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/542
Disable peft compatibility by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/543
Improve README by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/544
Add ROCm dockerfile by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/545
Make all tests pass by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/546
Fix cuda wheel build workflows by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/547
Use bash in workflows by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/548
Dissociate Windows & Linux CUDA build by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/549* Add more guards on compute capability in Marlin kernel by @fxmarty in https://github.com/AutoGPTQ/AutoGPTQ/pull/550

New Contributors

@hzhwcmhf made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/495
@rib-2 made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/522
@seungrokj made their first contribution in https://github.com/AutoGPTQ/AutoGPTQ/pull/515

Full Changelog: https://github.com/AutoGPTQ/AutoGPTQ/compare/v0.6.0...v0.7.0

AutoGPTQ - v0.6.0: Mixtral, StableLM, DeciLM, Yi support, Transformers 4.36 compatibility

Published by fxmarty 10 months ago

What's Changed

Precise PyTorch version by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/421
Fix triton unexpected keyword by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/423
Add support for Yi models. by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/413
Add support for Xverse models. by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/417
Allow fp32 input to GPTQ linear by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/437
Fix typos in tests by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/438
Update _base.py - Remote (.bin) model load fix by @Shades-en in https://github.com/PanQiWei/AutoGPTQ/pull/465
make build successful on Jetson device(L4T) by @mikeshi80 in https://github.com/PanQiWei/AutoGPTQ/pull/470
Add option to disable qigen at build by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/471
Stop trying to convert a list to int in setup.py when trying to retrieve cores_info by @wemoveon2 in https://github.com/PanQiWei/AutoGPTQ/pull/474
Only make_quant on inside_layer_modules. by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/479
Add support for DeciLM models. by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/481
Support for StableLM Epoch models. by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/444
Add support for Mixtral models. by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/480
Fix compatibility with transformers 4.36 by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/483

New Contributors

@Shades-en made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/465
@mikeshi80 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/470
@wemoveon2 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/474

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.5.1...v0.6.0

AutoGPTQ - v0.5.1: Patch release

Published by fxmarty 11 months ago

Mainly fixes Windows support.

What's Changed

Update README and version following 0.5.0 release by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/397
Fix windows support by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/407
Fix quantize method with None mask by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/408
Improve message about buffer size in exllama v1 backend by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/410
Fix windows (no triton) and cpu-only support by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/411
Fix workflows to use pip instead of conda by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/419

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.5.0...v0.5.1

AutoGPTQ - v0.5.0: Exllama v2 GPTQ kernels, RoCm 5.6/5.7 support, many bugfixes

Published by fxmarty 12 months ago

Exllama v2 GPTQ kernel support

The more performant GPTQ kernels from @turboderp's exllamav2 library are now available directly in AutoGPTQ, and are the default backend choice.

A comprehensive benchmark is available here.

exllamav2 integration by @SunMarc in https://github.com/PanQiWei/AutoGPTQ/pull/349

CPU inference support

This is experimental.

Add AutoGPTQ's cpu kernel. by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/245

Loading from safetensors is now the default

Allow using a model with basename model, use_safetensors defaults to True by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/383

Falcon, Mistral support

Add support for Falcon as part of Transformers 4.33.0, including new Falcon 180B by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/326
Add support for Mistral models. by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/362

Other changes and bugfixes

Fix setuptools classifier by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/285
Update install instructions by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/286
Install skip qigen(windows) by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/309
fix model type changed after calling .to() method by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/310
Update qwen.py for Qwen-VL by @JustinLin610 in https://github.com/PanQiWei/AutoGPTQ/pull/303
fix typo in max_input_length by @SunMarc in https://github.com/PanQiWei/AutoGPTQ/pull/311
Use adapter_name for get_gptq_peft_model with train_mode=True by @alex4321 in https://github.com/PanQiWei/AutoGPTQ/pull/347
Ignore unknown parameters in quantize_config.json by @z80maniac in https://github.com/PanQiWei/AutoGPTQ/pull/335
fix bug(breaking change) remove (zeors -= 1) by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/325
Revert "fix bug(breaking change) remove (zeors -= 1)" by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/354
import exllama QuantLinear instead of exllamav2's in pack_model by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/355
Modify qlinear_cuda for tracing the GPTQ model by @vivekkhandelwal1 in https://github.com/PanQiWei/AutoGPTQ/pull/367
Fix QiGen kernel generation by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/379
Improve RoCm support by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/382
PEFT initialization fix by @alex4321 in https://github.com/PanQiWei/AutoGPTQ/pull/361
Pin to accelerate>=0.22 by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/384
Fix overflow in exllama with act-order by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/386
Default to exllama kernel when exllama v2 is disabled by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/387
Error out on exllama_set_max_input_length call without exllama backend by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/389
Add fix for CPU Inference by @vivekkhandelwal1 in https://github.com/PanQiWei/AutoGPTQ/pull/385
Fix dtype issues and add relevant tests by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/393
Patch accelerate to use correct dtype by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/394
Fixed missing cstdint include by @kodai2199 in https://github.com/PanQiWei/AutoGPTQ/pull/388
Update RoCm workflow to build for RoCm 5.7 by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/395
Fix Windows build by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/396

New Contributors

@JustinLin610 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/303
@SunMarc made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/311
@alex4321 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/347
@vivekkhandelwal1 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/367
@kodai2199 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/388

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.2...v0.5.0

AutoGPTQ - v0.4.2: Patch release

Published by fxmarty about 1 year ago

Major bugfix: exllama backend with arbitrary input length

This patch release includes a major bugfix to have the exllama backend work with input length > 2048 through a reconfigurable buffer size:

from auto_gptq import exllama_set_max_input_length

...
model = exllama_set_max_input_length(model, 4096)

Expose a function to update exllama max input length by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/281

Exllama kernels support in Windows wheels

This patch tentatively includes the exllama kernels in the wheels for Windows.

Add PyPI build workflow, tentatively fix exllama on windows by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/282

What's Changed

Build wheels on ubuntu 20.04 by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/272
Free disk space for rocm build by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/273
Use focal for RoCm build by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/274
Use conda incubator for rocm build by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/276
Update install instructions by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/275
Use --extra-index-url to resolve dependencies by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/277
Fix python version for rocm build by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/278
Fix powershell in workflow by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/284

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.1...v0.4.2

AutoGPTQ - v0.4.1: Patch Fix

Published by PanQiWei about 1 year ago

Overview

Fix typo so not only pytorch==2.0.0 but also pytorch>=2.0.0 can be used for llama fused attention.
Patch exllama QuantLinear to avoid modifying the state dict to make the integration with transformers smoother.

Change Log

What's Changed

Patch exllama QuantLinear to avoid modifying the state dict by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/243

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.4.0...v0.4.1

AutoGPTQ - v0.4.0

Published by PanQiWei about 1 year ago

Overview

New platform: support ROCm platform (5.4.2 for now, and will extend to 5.5 and 5.6 as soon as pytorch officially release 2.1.0).
New kernels: support exllama q4 kernels to get at least 1.3x inference speedup.
New quantization strategy: support to specify static_groups=True on quantization which can futher improve quantized model's performance and close the gap of PPL again un-quantized model.
New model: qwen

Full Change Log

What's Changed

Add RoCm support by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/214
Fix revision used to load the quantization config by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/220
[General Quant Linear] Register quant params of general quant linear for friendly post process. by @LeiWang1999 in https://github.com/PanQiWei/AutoGPTQ/pull/226
Add exllama q4 kernel by @fxmarty in https://github.com/PanQiWei/AutoGPTQ/pull/219
Suppprt static groups and fix bug by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/236
support qwen by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/240

New Contributors

@fxmarty made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/214
@LeiWang1999 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/226

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.3.2...v0.4.0

AutoGPTQ - v0.3.2: Patch Fix

Published by PanQiWei about 1 year ago

Overview

Fix CUDA kernel bug that cause desc_act and group_size can't be used together
Improve user experience of manually installation
Improve user experience of loading quantized model
Add perplexity_utils.py to gracefully calculate PPL so that the result can be used to compare with other libraries fairly
Remove save_dir argument from from_quantized model, and now only model_name_or_path argument is supported in this method

Full Change Log

What's Changed

Fix cuda bug by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/202
Fix revision and other huggingface_hub kwargs in .from_quantized() by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/205
Change the install script so it attempts to build the CUDA extension in all cases by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/206
Add a central version number by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/207
Add Safetensors metadata saving, with some values saved to each .safetensor file by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/208
[FEATURE] Implement perplexity metric to compare against llama.cpp by @casperbh96 in https://github.com/PanQiWei/AutoGPTQ/pull/166
Fix error raised when CUDA kernels are not installed by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/209
Fix build on non-CUDA machines after #206 by @casperbh96 in https://github.com/PanQiWei/AutoGPTQ/pull/212

New Contributors

@casperbh96 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/166

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.3.0...v0.3.2

AutoGPTQ - v0.3.0

Published by PanQiWei over 1 year ago

Overview

CUDA kernels improvement: support models whose hidden_size can only divisible by 32/64 instead of 256.
Peft integration: support training and inference using LoRA, AdaLoRA, AdaptionPrompt, etc.
New models: BaiChuan, InternLM.
Other updates: see 'Full Change Log' below for details.

Full Change Log

What's Changed

Pytorch qlinear by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/116
Specify UTF-8 encoding for README.md in setup.py by @EliEron in https://github.com/PanQiWei/AutoGPTQ/pull/132
Support cuda 64dim by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/126
Support 32dim by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/125
Peft integration by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/102
Support setting inject_fused_attention and inject_fused_mlp to False by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/134
Add transpose operator when replace Conv1d with qlinear_cuda_old by @geekinglcq in https://github.com/PanQiWei/AutoGPTQ/pull/140
Add support for BaiChuan model by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/164
Fix error message by @AngainorDev in https://github.com/PanQiWei/AutoGPTQ/pull/141
Add support for InternLM by @cczhong11 in https://github.com/PanQiWei/AutoGPTQ/pull/189
Fix stale documentation by @MarisaKirisame in https://github.com/PanQiWei/AutoGPTQ/pull/158

New Contributors

@EliEron made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/132
@geekinglcq made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/140
@AngainorDev made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/141
@cczhong11 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/189
@MarisaKirisame made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/158

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.2.1...v0.3.0

AutoGPTQ - v0.2.2: Patch Release

Published by PanQiWei over 1 year ago

fix autogptq_cuda dir missed in distribution file

AutoGPTQ - v0.2.1: Patch Release

Published by PanQiWei over 1 year ago

Fix the problem that installation from pypi failed when the environment variable CUDA_VERSION is set.

AutoGPTQ - v0.2.0

Published by PanQiWei over 1 year ago

Happy International Children's Day! 🎈 At the age of LLMs and the dawn of AGI, may we always be curious like children, with vigorous energy and courage to explore the bright future.

Features Summary

There are bunch of new features been added in this version:

Optimized modules for faster inference speed: fused attention for llama and gptj, fused mlp for llama
Full CPU offloading
Multiple GPUs inference with triton backend
Three new models are supported: codegen, gpt_bigcode and falcon
Support download/upload quantized model from/to HF Hub

Change Log

Below are the detailed change log:

Fix bug cuda by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/44
Fix bug caused by 'groupsize' vs 'group_size' and change all code to use 'group_size' consistently by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/58
Setup conda by @Sciumo in https://github.com/PanQiWei/AutoGPTQ/pull/59
fix incorrect pack while using cuda, desc_act and grouping by @lszxb in https://github.com/PanQiWei/AutoGPTQ/pull/62
Faster llama by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/43
Gptj fused attention by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/76
Look for .pt files by @oobabooga in https://github.com/PanQiWei/AutoGPTQ/pull/79
Support users customize device_map by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/80
Update example script to include desc_act by @Ph0rk0z in https://github.com/PanQiWei/AutoGPTQ/pull/82
Forward position args to allow model(tokens) syntax by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/84
Rename 'quant_cuda' to 'autogptq_cuda' to avoid conflicts with existing GPTQ-for-LLaMa installations. by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/93
fix ImportError when triton is not installed by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/92
Fix CUDA out of memory error in qlinear_old.py by @LexSong in https://github.com/PanQiWei/AutoGPTQ/pull/66
Improve CPU offload by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/100
triton float32 support by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/104
Add support for CodeGen/2 by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/65
Add support for GPTBigCode(starcoder) by @LaaZa in https://github.com/PanQiWei/AutoGPTQ/pull/63
Minor syntax fix for auto.py by @billcai in https://github.com/PanQiWei/AutoGPTQ/pull/112
Falcon support by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/111
Add support for HF Hub download, and push_to_hub by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/91
Add build wheels workflow by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/120

New Contributors

Following are new contributors and their first pr. Thank you very much for your love of auto_gptq and contributions! ❤️

@Sciumo made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/59
@lszxb made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/62
@oobabooga made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/79
@Ph0rk0z made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/82
@LexSong made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/66
@LaaZa made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/65
@billcai made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/112

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.1.0...v0.2.0

AutoGPTQ - v0.1.0

Published by PanQiWei over 1 year ago

What's Changed

add option by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/23
Add gpt2 by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/30
Fix bug speedup quant and support gpt2 by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/29
Offloading and Multiple devices quantization/inference by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/24
Add raise exception and gpt2 xl example add by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/31
Allow to load arbitrary models by @z80maniac in https://github.com/PanQiWei/AutoGPTQ/pull/33
Change save name by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/34
Fix typo: 'hole' -> 'whole' by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/40
bug fix quantization demo by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/37
Check that model_save_name exists before trying to load it, to avoid confusing checkpoint error by @TheBloke in https://github.com/PanQiWei/AutoGPTQ/pull/39
Faster cuda no actorder by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/38

New Contributors

@z80maniac made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/33
@TheBloke made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/40

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.0.5...v0.1.0

AutoGPTQ - v0.0.5

Published by PanQiWei over 1 year ago

What's Changed

add simple demo ppl test with wikitext2 by @qwopqwop200 in https://github.com/PanQiWei/AutoGPTQ/pull/17
push_to_hub integration by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/18

New Contributors

@qwopqwop200 made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/17

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.0.4...v0.0.5

AutoGPTQ - v0.0.4

Published by PanQiWei over 1 year ago

Big News

triton is officially supported start from this version!
quick install from pypi using pip install auto-gptq is supported start from this version!

What's Changed

Support MOSS model by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/15
Triton integration by @PanQiWei in https://github.com/PanQiWei/AutoGPTQ/pull/16

Full Changelog: https://github.com/PanQiWei/AutoGPTQ/compare/v0.0.3...v0.0.4

AutoGPTQ - v0.0.3

Published by PanQiWei over 1 year ago

What's Changed

fix typo in README.md
fix problem that can't get some models' max sequence length
fix problem that some models have more required positional arguments when forward in transformer layers
fix mismatch GPTNeoxForCausalLM's lm_head

New Contributors

@eltociear made their first contribution in https://github.com/PanQiWei/AutoGPTQ/pull/10

AutoGPTQ - v0.0.2

Published by PanQiWei over 1 year ago

added eval_tasks module to support evaluate model's performance on predefined down-stream tasks before and after quantization
fixed some bugs when using LLaMa model
fixed some bugs when using models that required position_ids