🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
APACHE-2.0 License
Bot releases are hidden (Show)
Published by echarlaix over 1 year ago
diffusers>=v0.18.0
by @echarlaix in https://github.com/huggingface/optimum/pull/1173
Full Changelog: https://github.com/huggingface/optimum/compare/v1.9.0...v1.9.1
Published by fxmarty over 1 year ago
Lower memory usage during the ONNX export. This is especially useful to export large models, or on cuda device. Until PyTorch 2.1 release, we recommend to use PyTorch nightly in case memory issues are encountered, as two major bugs were fixed on PyTorch side: https://github.com/pytorch/pytorch/pull/101134 https://github.com/pytorch/pytorch/pull/101148
The ONNX export now supports the sam, lilt, pix2struct, cvt and owlvit architectures.
The method main_export
now supports two arguments model_kwargs
and custom_onnx_configs
that allow for a more custom export for advanced users. Reference.
IO Binding is useful not only to avoid RAM/device memory copies, but also simply between numpy tensors and OrtValue. Thus, for autoregressive tasks we enable IO Binding as a default on CPUExecutionProvider as well, which may bring >10% speedup for large context lengths.
OptimizationConfig
by @IlyasMoutawwakil in https://github.com/huggingface/optimum/pull/1036
attention_mask
in ORTModelForxxx
by @IlyasMoutawwakil in https://github.com/huggingface/optimum/pull/1045
input_points
data type by @michaelbenayoun in https://github.com/huggingface/optimum/pull/1048
masked-im
output name fix for transformers >= 4.29.0 by @michaelbenayoun in https://github.com/huggingface/optimum/pull/1049
ORTQuantizer.quantize
call for static quantization when no calibration range is provided by @fxmarty in https://github.com/huggingface/optimum/pull/1094
Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.0...v1.9.0
Published by echarlaix over 1 year ago
transformers>=v4.30.0
by @echarlaix in https://github.com/huggingface/optimum/pull/1102
Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.7...v1.8.8
Published by echarlaix over 1 year ago
Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.6...v1.8.7
Published by regisss over 1 year ago
Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.5...v1.8.6
Published by regisss over 1 year ago
transformers<4.29.0
in Habana extra by @regisss in #1047Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.4...v1.8.5
Published by echarlaix over 1 year ago
Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.3...v1.8.4
Published by echarlaix over 1 year ago
optimum-neuron
extra by @michaelbenayoun in https://github.com/huggingface/optimum/pull/1021
Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.2...v1.8.3
Published by fxmarty over 1 year ago
Various improvements in the PyTorch BetterTransformer integration.
BetterTransformer
support for ProphetNet by @hirotasoshu in https://github.com/huggingface/optimum/pull/923
BT
] Improve docs by @younesbelkada in https://github.com/huggingface/optimum/pull/944
Instead of using two separate decoder_model.onnx
and decoder_with_past_model.onnx
models, a single decoder can be used for encoder-decoder models: decoder_model_merged.onnx
. This allows to avoid duplicated weights in the two without/with past ONNX models.
By default, if available, the decoder_model_merged.onnx
will be used in the ORTModel integration. This can be disabled with the option --no-post-process
in the ONNX export CLI, and with use_merged=False
in the ORTModel.from_pretrained
method.
Example:
optimum-cli export onnx --model t5-small t5_onnx
will give:
└── t5_onnx
  ├── config.json
  ├── decoder_model_merged.onnx
  ├── decoder_model.onnx
  ├── decoder_with_past_model.onnx
  ├── encoder_model.onnx
  ├── generation_config.json
  ├── special_tokens_map.json
  ├── spiece.model
  ├── tokenizer_config.json
  └── tokenizer.json
And decoder_model_merged.onnx
is enough to be used for inference. We strongly recommend to inspect the subgraphs with netron to understand what are the inputs/outputs, in case the exported model is to be used with an other engine than ONNX Runtime in the Optimum integration.
The TasksManager replaces legacy tasks names by the canonical ones used on the Hub and in transformers metadata:
sequence-classification
becomes text-classification
,causal-lm
becomes text-generation
,seq2seq-lm
becomes text2text-generation
,speech2seq-lm
and audio-ctc
becomes automatic-speech-recognition
,default
becomes feature-extraction
,masked-lm
becomes fill-mask
,vision2seq-lm
becomes image-to-text
This should not break anything except if you rely on private methods and attributes from TasksManager
.
optimun-cli onnxruntime quantize / optimize
output argument is now required by @michaelbenayoun in https://github.com/huggingface/optimum/pull/927
optimum-cli
print the help of subcommands by @michaelbenayoun in https://github.com/huggingface/optimum/pull/940
Full Changelog: https://github.com/huggingface/optimum/compare/v1.7.3...v1.8.2
Published by fxmarty over 1 year ago
This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.
We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary Identity
nodes in the models.
torch.nn.functional.scaled_dot_product_attention
support for decoders in BetterTransformerPytorch 2.0 introduces in beta torch.nn.functional.scaled_dot_product_attention
, a fastpath for attention extending their accelerated transformer features. This is included in optimum.bettertransformer
to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.
Beware that this is still experimental and speedups have yet to be validated on all architectures.
PyTorch's scaled_dot_product_attention
allows to use flash attention and memory efficient attention natively in PyTorch.
Usage is as follow:
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model = BetterTransformer.transform(model) # modify transformers modeling to use native scaled_dot_product_attention
# do you inference or training here
model = BetterTransformer.reverse(model) # go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")
Inference benchmark (on fp16):
Model | batch size | Input sequence length | Generated tokens | Latency eager (s) | Latency BT (s) | Speedup | Peak memory eager (MB) | Peak memory BT (MB) | Memory savings |
---|---|---|---|---|---|---|---|---|---|
gpt2 | 1 | 64 | 256 | 1.800 | 1.607 | 12.0% | 569.90 | 569.89 | 0% |
gpt2 | 64 | 64 | 256 | 2.159 | 1.617 | 33.5% | 2067.45 | 2093.80 | 0% |
opt-1.3b | 1 | 64 | 256 | 3.010 | 2.667 | 12.9% | 5408.238 | 5408.238 | 0% |
gpt-neox-20b | 1 | 64 | 256 | 10.869 | 9.937 | 9.4% | 83670.67 | 83673.53 | 0% |
Training benchmark (on fp16):
Model | batch size | Sequence length | time/epoch (eager, s) | time/epoch (BT, s) | Speedup | Peak memory eager (MB) | Peak memory BT (MB) | Memory savings |
---|---|---|---|---|---|---|---|---|
gpt2 | 8 | 1024 | 17.732 | 14.037 | 26.3% | 13291.16 | 10191.52 | 30.4% |
gpt2 | 32 | 1024 | 17.336 | 13.309 | 30.3% | 52834.83 | 38858.56 | 36.0% |
gpt2 | 64 | 1024 | OOM | 14.067 | / | OOM | 75600.08 | / |
Benchmarks can be reproduced using the inference script and training script:
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0
BT
] add decoder benchmark script by @younesbelkada in https://github.com/huggingface/optimum/pull/857
BT
] Fix bt benchmark by @younesbelkada in https://github.com/huggingface/optimum/pull/858
BT
] Add fp16 support by @younesbelkada in https://github.com/huggingface/optimum/pull/859
BT
] Add decoder training support by @younesbelkada in https://github.com/huggingface/optimum/pull/860
BT
] add accelerate_test
markers by @younesbelkada in https://github.com/huggingface/optimum/pull/864
Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.
Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.
TasksManager
by @michaelbenayoun in https://github.com/huggingface/optimum/pull/898
Full Changelog: https://github.com/huggingface/optimum/compare/v1.2.0...v1.7.2
Published by fxmarty over 1 year ago
Temporarily fix a critical bug in BetterTransformer https://github.com/huggingface/optimum/pull/849
Full Changelog: https://github.com/huggingface/optimum/compare/v1.7.0...v1.7.1
Published by fxmarty over 1 year ago
Additional architectures are supported in the ONNX export: PoolFormer, Pegasus, Audio Spectrogram Transformer, Hubert, SEW, Speech2Text, UniSpeech, UniSpeech-SAT, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Data2Vec Audio, MPNet, stable diffusion VAE encoder, vision encoder decoder, Nystromformer, Splinter, GPT NeoX.
optimum.exporters.onnx
by @michaelbenayoun in https://github.com/huggingface/optimum/pull/622
A few additional architectures are supported in BetterTransformer: RoCBERT, RoFormer, Marian
With ORTModelForMaskedLM, ORTModelForVision2Seq, ORTModelForAudioClassification, ORTModelForCTC, ORTModelForAudioXVector, ORTModelForAudioFrameClassification, ORTStableDiffusionPipeline.
Reference: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort and https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models#export-and-inference-of-stable-diffusion-models
In the ONNX export, it is possible to pass the options --fp16 --device cuda
to export using float16 when a GPU is available, directly with the native torch.onnx.export
.
Example: optimum-cli export onnx --model gpt2 --fp16 --device cuda gpt2_onnx/
torch.float16
type by @fxmarty in https://github.com/huggingface/optimum/pull/749
TFLite export is now supported, with static shapes:
optimum-cli export tflite --help
optimum-cli export tflite --model bert-base-uncased --sequence_length 128 bert_tflite/
exporters.tflite
initial support by @michaelbenayoun in https://github.com/huggingface/optimum/pull/716
The ONNX export optionally supports the ONNX Runtime optimizations directly in the export, passing the --optimize O1
, up to --optimize O4
option:
optimum-cli export onnx --help
optimum-cli export onnx --model t5-small --optimize O3 t5small_onnx/
ONNX Runtime quantization is supported directly in command line, using optimum-cli onnxruntime quantize
:
optimum-cli onnxruntime quantize --help
optimum-cli onnxruntime quantize --onnx_model distilbert_onnx --avx512
ONNX Runtime optimization is supported directly in command line, using optimum-cli onnxruntime optimize
:
optimum-cli onnxruntime optimize --help
optimum-cli onnxruntime optimize --onnx_model distilbert_onnx -O3
Up no now, for decoders, two ONNX were used:
This release introduces the support in the ONNX export and in ORTModelForCausalLM
of a single ONNX handling both steps of the decoding. This allows to reduce memory usage, as weights are not duplicated between two separate models during inference.
Using a single ONNX for decoders can be used by passing use_merged=True
to ORTModelForCausalLM.from_pretrained
, loading directly from a PyTorch model:
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained("gpt2", export=True, use_merged=True)
Alternatively, using a single ONNX for decoders is the default behavior in the ONNX export, that can later be used for example with ORTModelForCausalLM
, the command optimum-cli export onnx --model gpt2 gpt2_onnx/
will produce:
└── gpt2_onnx
  ├── config.json
  ├── decoder_model_merged.onnx
  ├── decoder_model.onnx
  ├── decoder_with_past_model.onnx
  ├── merges.txt
  ├── special_tokens_map.json
  ├── tokenizer_config.json
  ├── tokenizer.json
  └── vocab.json
The decoder_model.onnx
and decoder_with_past_model.onnx
are kept separate for backward compatibility, but during inference using solely decoder_model_merged.onnx
is enough.
ORTModelForCausalLM
by @JingyaHuang in https://github.com/huggingface/optimum/pull/647
ORTModel accept numpy arrays as inputs, in addition to PyTorch tensors. This is only the case for models that use a single ONNX.
--monolith
.--task causal-lm
instead of --task causal-lm-with-past
.block_sparse
attention type being written in pure numpy in Transformers, and hence not exportable to ONNX: https://github.com/huggingface/optimum/pull/778
from_transformers
of ORTModel.from_pretrained
will be deprecated in favor of export
.use_cache=True
to ORTModel and no ONNX with cache is available by @fxmarty in https://github.com/huggingface/optimum/pull/650
from optimum.onnxruntime import QuantizationConfig
by @fxmarty in https://github.com/huggingface/optimum/pull/715
ORTTrainer
by @JingyaHuang in https://github.com/huggingface/optimum/pull/709
onnxruntime/modeling_ort.py
refactor, part 1 by @michaelbenayoun in https://github.com/huggingface/optimum/pull/698
ORTTrainer
inference with ONNX Runtime backend by @JingyaHuang in https://github.com/huggingface/optimum/pull/737
BetterTransformer.transform()
by @fxmarty in https://github.com/huggingface/optimum/pull/750
exporters.onnx
output names and dynamic axes fix by @michaelbenayoun in https://github.com/huggingface/optimum/pull/731
BT
] Add stable layer-norm Wav2vec2 by @younesbelkada in https://github.com/huggingface/optimum/pull/803
Full Changelog: https://github.com/huggingface/optimum/compare/v1.6.0...v1.7.0
Published by fxmarty over 1 year ago
Full Changelog: https://github.com/huggingface/optimum/compare/v1.6.3...v1.6.4
Published by JingyaHuang over 1 year ago
Fixes ORTTrainer
for the inference with the ONNX Runtime backend.
Published by fxmarty over 1 year ago
The export of speech-to-text architecture as a single ONNX file (that handles both the encoding and decoding) fails do to a regression with the latest transformers version: https://github.com/huggingface/optimum/issues/721
Full Changelog: https://github.com/huggingface/optimum/compare/v1.6.1...v1.6.2
Published by fxmarty almost 2 years ago
Full Changelog: https://github.com/huggingface/optimum/compare/v1.6.0...v1.6.1
Published by fxmarty almost 2 years ago
The Optimum command line interface is introduced, and is now the official entrypoint for the ONNX export. Example commands:
optimum-cli --help
optimum-cli export onnx --help
optimum-cli export onnx --model bert-base-uncased --task sequence-classification bert_onnx/
Optimum now supports the ONNX export of stable diffusion models from the diffusers library:
optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 sd_v15_onnx/
BetterTransformer integration includes new models in this release: CLIP, RemBERT, mBART, ViLT, FSMT
The complete list of supported models is available in the documentation.
Bettertransformer
support for FSMT by @Sumanth077 in https://github.com/huggingface/optimum/pull/494
BetterTransformer
support for ViLT architecture by @ka00ri in https://github.com/huggingface/optimum/pull/508
MBart
support for BetterTransformer
by @ravenouse in https://github.com/huggingface/optimum/pull/516
The ONNX export now supports Swin, MobileNet-v1, MobileNet-v2.
ONNX
] add mobilenet
support by @younesbelkada in https://github.com/huggingface/optimum/pull/633
Encoder-decoder or decoder-only models normally making use of the generate()
method in transformers can now be exported in several files using the --for-ort
argument:
optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_small_onnx
yielding:
.
└── t5_small_onnx
  ├── config.json
  ├── decoder_model.onnx
  ├── decoder_with_past_model.onnx
  ├── encoder_model.onnx
  ├── special_tokens_map.json
  ├── spiece.model
  ├── tokenizer_config.json
  └── tokenizer.json
Passing --for-ort
, exported models are expected to be loadable directly into ORTModel.
--for-ort
from optimum.exporters.onnx
in ORTDecoder
by @fxmarty in https://github.com/huggingface/optimum/pull/554
The ONNX export from PyTorch normally creates external data in case the exported model is larger than 2 GB. This release introduces a better support for the export and use of large models, writting all external data into a .onnx_data
file if necessary.
Various improvements to allow for a better user experience in the ONNX Runtime integration:
ORTModel
, ORTModelDecoder
and ORTModelForConditionalGeneration
can now load any ONNX model files regardless of their names, allowing to load optimized and quantized models without having to specify a file name argument.
ORTModel.from_pretrained()
with from_transformers=True
now downloads and loads the model in a temporary directory instead of the cache, which was not a right place to store it.
ORTQuantizer.save_pretrained()
now saves the model configuration and the preprocessor, making the exported directory usable end-to-end.
ORTOptimizer.save_pretrained()
now saves the preprocessor, making the exported directory usable end-to-end.
ONNX Runtime integration API improvement by @michaelbenayoun in https://github.com/huggingface/optimum/pull/515
The shape of the example input to provide for the export to ONNX can be overridden in case the validity of the ONNX model is sensitive to the shape used during the export.
Read more: optimum-cli export onnx --help
use_cache=True
for ORTModelForCausalLMReusing past key values for models using ORTModelForCausalLM (e.g. gpt2) is now possible using use_cache=True
, avoiding to recompute them at each iteration of the decoding:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = ORTModelForCausalLM.from_pretrained("gpt2", from_transformers=True, use_cache=True)
inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")
gen_tokens = model.generate(**inputs)
tokenizer.batch_decode(gen_tokens)
ORTModelForCustomTasks now supports IO Binding when using CUDAExecutionProvider.
Along with --for-ort
, when passing --task causal-lm-with-past
, --task seq2seq-with-past
or --task speech2seq-lm-with-past
during the ONNX export exports two models: one not using the previously computed keys/values, and one using them.
An experimental support is introduced to merge the two models in one. Example:
optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_onnx/
import onnx
from optimum.onnx import merge_decoders
decoder = onnx.load("t5_onnx/decoder_model.onnx")
decoder_with_past = onnx.load("t5_onnx/decoder_with_past_model.onnx")
merged_model = merge_decoders(decoder, decoder_with_past)
onnx.save(merged_model, "t5_onnx/decoder_merged_model.onnx")
norm_first
by @younesbelkada in https://github.com/huggingface/optimum/pull/510
encoder_last_hidden_state
as an output for encoder-decoder models by @fxmarty in https://github.com/huggingface/optimum/pull/601
use_io_binding
default value for different execution providers by @JingyaHuang in https://github.com/huggingface/optimum/pull/604
Full Changelog: https://github.com/huggingface/optimum/compare/v1.5.2...v1.6.0
The following contributors have made significant changes to the library over the last release:
Published by fxmarty almost 2 years ago
Constraint temporarily numpy<1.24.0 (#614)
Published by fxmarty almost 2 years ago
Deprecate PyTorch 1.12. for BetterTransformer with better error message (#513)
Published by michaelbenayoun almost 2 years ago
Convert your model into its PyTorch BetterTransformer
format using a one liner with the new BetterTransformer
integration for faster inference on CPU and GPU!
from optimum.bettertransformer import BetterTransformer
model = BetterTransformer.transform(model)
Check the full list of supported models in the documentaiton, and check out the Google Colab demo.
BetterTransformer
integration (#423)ORT models (except for ORTModelForCustomTasks
) now support IOBinding to avoid data copying overheads between the host and device. Significant inference speedup during the decoding process on GPU.
By default, use_io_binding
is set to True
when using CUDA. You can turn off the IOBinding in case of any memory issue:
from optimum.onnxruntime import ORTModelForSeq2SeqLM
model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small", use_io_binding=False)
optimum.exporters
is a new module that handles the export of PyTorch and TensorFlow models to several backends. Only ONNX is supported for now, and more than 50 architectures can already be exported, among which BERT, GPT-Neo, Bloom, T5, ViT, Whisper, CLIP.
The export can be done via the CLI:
python -m optimum.exporters.onnx --model openai/whisper-tiny.en whisper_onnx/
For more information, check the documentation.
optimum.exporters
creation (#403)optimum.exporters
.optimum.onnxruntime
, IO binding is also supported.Note: For the now the export from optimum.exporters
will not be usable by ORTModelForSpeechSeq2Seq
. To be able to run inference, export Whisper directly using ORTModelForSpeechSeq2Seq
. This will be solved in the next release.
optimum.onnxruntime
and optimum.exporters
(#420)transformers
4.23.1 (#434)ORTModel
can load models from subfolders in a similar fashion as in transformers
(#443)ORTOptimizer
has been refactored, and a factory class has been added to create common OptimizationConfig
s (#457)