optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools

APACHE-2.0 License

Downloads
946K
Stars
2.1K
Committers
84

Bot releases are visible (Hide)

optimum - v1.9.1: Patch release

Published by echarlaix over 1 year ago

Full Changelog: https://github.com/huggingface/optimum/compare/v1.9.0...v1.9.1

optimum - v1.9: extended ONNX, ONNX Runtime support

Published by fxmarty over 1 year ago

Improved memory management in the ONNX export

Lower memory usage during the ONNX export. This is especially useful to export large models, or on cuda device. Until PyTorch 2.1 release, we recommend to use PyTorch nightly in case memory issues are encountered, as two major bugs were fixed on PyTorch side: https://github.com/pytorch/pytorch/pull/101134 https://github.com/pytorch/pytorch/pull/101148

Extended ONNX export

The ONNX export now supports the sam, lilt, pix2struct, cvt and owlvit architectures.

Support of custom ONNX configurations for export

The method main_export now supports two arguments model_kwargs and custom_onnx_configs that allow for a more custom export for advanced users. Reference.

Extended BetterTransformer support

ONNX Runtime: use IO Binding by default for decoder models on CPUExecutionProvider

IO Binding is useful not only to avoid RAM/device memory copies, but also simply between numpy tensors and OrtValue. Thus, for autoregressive tasks we enable IO Binding as a default on CPUExecutionProvider as well, which may bring >10% speedup for large context lengths.

ORTModelForSpeechSeq2Seq supported in ORTOptimizer

Major bugfixes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.0...v1.9.0

optimum - v1.8.8: Patch release

Published by echarlaix over 1 year ago

Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.7...v1.8.8

optimum - v1.8.7: Patch release

Published by echarlaix over 1 year ago

optimum - v1.8.6: Patch release

Published by regisss over 1 year ago

  • Fix CLI for exporting models to TFLite by @regisss #1059

Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.5...v1.8.6

optimum - v1.8.5: Patch release

Published by regisss over 1 year ago

  • Add transformers<4.29.0 in Habana extra by @regisss in #1047

Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.4...v1.8.5

optimum - v1.8.4: Patch release

Published by echarlaix over 1 year ago

optimum - v1.8.3: Patch release

Published by echarlaix over 1 year ago

Full Changelog: https://github.com/huggingface/optimum/compare/v1.8.2...v1.8.3

optimum - v1.8: extended BetterTransformer support, ONNX merged seq2seq models

Published by fxmarty over 1 year ago

Extended BetterTransformer support

Various improvements in the PyTorch BetterTransformer integration.

ONNX merged seq2seq models

Instead of using two separate decoder_model.onnx and decoder_with_past_model.onnx models, a single decoder can be used for encoder-decoder models: decoder_model_merged.onnx. This allows to avoid duplicated weights in the two without/with past ONNX models.

By default, if available, the decoder_model_merged.onnx will be used in the ORTModel integration. This can be disabled with the option --no-post-process in the ONNX export CLI, and with use_merged=False in the ORTModel.from_pretrained method.

Example:

optimum-cli export onnx --model t5-small t5_onnx

will give:

└── t5_onnx
    ├── config.json
    ├── decoder_model_merged.onnx
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── encoder_model.onnx
    ├── generation_config.json
    ├── special_tokens_map.json
    ├── spiece.model
    ├── tokenizer_config.json
    └── tokenizer.json

And decoder_model_merged.onnx is enough to be used for inference. We strongly recommend to inspect the subgraphs with netron to understand what are the inputs/outputs, in case the exported model is to be used with an other engine than ONNX Runtime in the Optimum integration.

New models in the ONNX export

Major bugfix

Potentially breaking changes

The TasksManager replaces legacy tasks names by the canonical ones used on the Hub and in transformers metadata:

  • sequence-classification becomes text-classification,
  • causal-lm becomes text-generation,
  • seq2seq-lm becomes text2text-generation,
  • speech2seq-lm and audio-ctc becomes automatic-speech-recognition,
  • default becomes feature-extraction,
  • masked-lm becomes fill-mask,
  • vision2seq-lm becomes image-to-text

This should not break anything except if you rely on private methods and attributes from TasksManager.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/optimum/compare/v1.7.3...v1.8.2

optimum - v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0

Published by fxmarty over 1 year ago

This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.

Breaking change: constant outputs removed from ONNX encoder-decoder models

We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary Identity nodes in the models.

torch.nn.functional.scaled_dot_product_attention support for decoders in BetterTransformer

Pytorch 2.0 introduces in beta torch.nn.functional.scaled_dot_product_attention, a fastpath for attention extending their accelerated transformer features. This is included in optimum.bettertransformer to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.

Beware that this is still experimental and speedups have yet to be validated on all architectures.

PyTorch's scaled_dot_product_attention allows to use flash attention and memory efficient attention natively in PyTorch.

Usage is as follow:

from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

model = BetterTransformer.transform(model)  # modify transformers modeling to use native scaled_dot_product_attention

# do you inference or training here

model = BetterTransformer.reverse(model)  # go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")

Inference benchmark (on fp16):

Model batch size Input sequence length Generated tokens Latency eager (s) Latency BT (s) Speedup Peak memory eager (MB) Peak memory BT (MB) Memory savings
gpt2 1 64 256 1.800 1.607 12.0% 569.90 569.89 0%
gpt2 64 64 256 2.159 1.617 33.5% 2067.45 2093.80 0%
opt-1.3b 1 64 256 3.010 2.667 12.9% 5408.238 5408.238 0%
gpt-neox-20b 1 64 256 10.869 9.937 9.4% 83670.67 83673.53 0%

Training benchmark (on fp16):

Model batch size Sequence length time/epoch (eager, s) time/epoch (BT, s) Speedup Peak memory eager (MB) Peak memory BT (MB) Memory savings
gpt2 8 1024 17.732 14.037 26.3% 13291.16 10191.52 30.4%
gpt2 32 1024 17.336 13.309 30.3% 52834.83 38858.56 36.0%
gpt2 64 1024 OOM 14.067 / OOM 75600.08 /

Benchmarks can be reproduced using the inference script and training script:

python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0

New architectures in the ONNX export

Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.

(WIP) TFLite export with quantization support

Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.

Bugfixes and improvements

New Contributors

Full Changelog: https://github.com/huggingface/optimum/compare/v1.2.0...v1.7.2

optimum - v1.7.1: Patch release

Published by fxmarty over 1 year ago

New models supported in the ONNX export

Additional architectures are supported in the ONNX export: PoolFormer, Pegasus, Audio Spectrogram Transformer, Hubert, SEW, Speech2Text, UniSpeech, UniSpeech-SAT, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Data2Vec Audio, MPNet, stable diffusion VAE encoder, vision encoder decoder, Nystromformer, Splinter, GPT NeoX.

New models supported in BetterTransformer

A few additional architectures are supported in BetterTransformer: RoCBERT, RoFormer, Marian

Additional tasks supported in the ONNX Runtime integration

With ORTModelForMaskedLM, ORTModelForVision2Seq, ORTModelForAudioClassification, ORTModelForCTC, ORTModelForAudioXVector, ORTModelForAudioFrameClassification, ORTStableDiffusionPipeline.

Reference: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort and https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models#export-and-inference-of-stable-diffusion-models

Support of the ONNX export from PyTorch on float16

In the ONNX export, it is possible to pass the options --fp16 --device cuda to export using float16 when a GPU is available, directly with the native torch.onnx.export.

Example: optimum-cli export onnx --model gpt2 --fp16 --device cuda gpt2_onnx/

TFLite export

TFLite export is now supported, with static shapes:

optimum-cli export tflite --help
optimum-cli export tflite --model bert-base-uncased --sequence_length 128 bert_tflite/

ONNX Runtime optimization and quantization directly in the CLI

The ONNX export optionally supports the ONNX Runtime optimizations directly in the export, passing the --optimize O1, up to --optimize O4 option:

optimum-cli export onnx --help
optimum-cli export onnx --model t5-small --optimize O3 t5small_onnx/

ONNX Runtime quantization is supported directly in command line, using optimum-cli onnxruntime quantize:

optimum-cli onnxruntime quantize --help
optimum-cli onnxruntime quantize --onnx_model distilbert_onnx --avx512

ONNX Runtime optimization is supported directly in command line, using optimum-cli onnxruntime optimize:

optimum-cli onnxruntime optimize --help
optimum-cli onnxruntime optimize --onnx_model distilbert_onnx -O3

ORTModelForCausalLM supports decoding with a single ONNX

Up no now, for decoders, two ONNX were used:

  • One handling the first forward pass where no past key values have been cached yet - thus not taking them as input.
  • One handling the following forward pass where past key values have been cached, thus taking them as input.

This release introduces the support in the ONNX export and in ORTModelForCausalLM of a single ONNX handling both steps of the decoding. This allows to reduce memory usage, as weights are not duplicated between two separate models during inference.

Using a single ONNX for decoders can be used by passing use_merged=True to ORTModelForCausalLM.from_pretrained, loading directly from a PyTorch model:

from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained("gpt2", export=True, use_merged=True)

Alternatively, using a single ONNX for decoders is the default behavior in the ONNX export, that can later be used for example with ORTModelForCausalLM, the command optimum-cli export onnx --model gpt2 gpt2_onnx/ will produce:

└── gpt2_onnx
    ├── config.json
    ├── decoder_model_merged.onnx
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── merges.txt
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    ├── tokenizer.json
    └── vocab.json

The decoder_model.onnx and decoder_with_past_model.onnx are kept separate for backward compatibility, but during inference using solely decoder_model_merged.onnx is enough.

Single-file ORTModel accept numpy arrays

ORTModel accept numpy arrays as inputs, in addition to PyTorch tensors. This is only the case for models that use a single ONNX.

ORTOptimizer support for ORTModelForCausalLM

Breaking changes

  • In the ONNX export, exporting models in several ONNX (encoder, decoder) is now the default behavior: https://github.com/huggingface/optimum/pull/747. The old behavior is still accessible with --monolith.
  • In decoders, reusing past key values is now the default in the ONNX export: https://github.com/huggingface/optimum/pull/748. The old behavior is still accessible by explicitly passing, for example, --task causal-lm instead of --task causal-lm-with-past.
  • BigBird support in the ONNX export is removed, due to the block_sparse attention type being written in pure numpy in Transformers, and hence not exportable to ONNX: https://github.com/huggingface/optimum/pull/778
  • The parameter from_transformers of ORTModel.from_pretrained will be deprecated in favor of export.

Bugfixes and improvements

Full Changelog: https://github.com/huggingface/optimum/compare/v1.6.0...v1.7.0

optimum - v1.6.4: Patch release

Published by fxmarty over 1 year ago

Bugfix

Full Changelog: https://github.com/huggingface/optimum/compare/v1.6.3...v1.6.4

optimum - v1.6.3: Patch release

Published by JingyaHuang over 1 year ago

Fixes ORTTrainer for the inference with the ONNX Runtime backend.

optimum - v1.6.2: Patch release

Published by fxmarty over 1 year ago

Hotfixes

Regressions

The export of speech-to-text architecture as a single ONNX file (that handles both the encoding and decoding) fails do to a regression with the latest transformers version: https://github.com/huggingface/optimum/issues/721

Full Changelog: https://github.com/huggingface/optimum/compare/v1.6.1...v1.6.2

optimum - v1.6.1: Patch release

Published by fxmarty almost 2 years ago

Hotfixes

Full Changelog: https://github.com/huggingface/optimum/compare/v1.6.0...v1.6.1

Optimum CLI

The Optimum command line interface is introduced, and is now the official entrypoint for the ONNX export. Example commands:

optimum-cli --help
optimum-cli export onnx --help
optimum-cli export onnx --model bert-base-uncased --task sequence-classification bert_onnx/

Stable Diffusion ONNX export

Optimum now supports the ONNX export of stable diffusion models from the diffusers library:

optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 sd_v15_onnx/

BetterTransformer support for more architectures

BetterTransformer integration includes new models in this release: CLIP, RemBERT, mBART, ViLT, FSMT

The complete list of supported models is available in the documentation.

ONNX export for more architectures

The ONNX export now supports Swin, MobileNet-v1, MobileNet-v2.

Extended ONNX export for encoder-decoder and decoder models

Encoder-decoder or decoder-only models normally making use of the generate() method in transformers can now be exported in several files using the --for-ort argument:

optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_small_onnx

yielding:

.
└── t5_small_onnx
    ├── config.json
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── encoder_model.onnx
    ├── special_tokens_map.json
    ├── spiece.model
    ├── tokenizer_config.json
    └── tokenizer.json

Passing --for-ort, exported models are expected to be loadable directly into ORTModel.

Support for ONNX models with external data at export, optimization, quantization

The ONNX export from PyTorch normally creates external data in case the exported model is larger than 2 GB. This release introduces a better support for the export and use of large models, writting all external data into a .onnx_data file if necessary.

ONNX Runtime API improvement

Various improvements to allow for a better user experience in the ONNX Runtime integration:

  • ORTModel, ORTModelDecoder and ORTModelForConditionalGeneration can now load any ONNX model files regardless of their names, allowing to load optimized and quantized models without having to specify a file name argument.

  • ORTModel.from_pretrained() with from_transformers=True now downloads and loads the model in a temporary directory instead of the cache, which was not a right place to store it.

  • ORTQuantizer.save_pretrained() now saves the model configuration and the preprocessor, making the exported directory usable end-to-end.

  • ORTOptimizer.save_pretrained() now saves the preprocessor, making the exported directory usable end-to-end.

  • ONNX Runtime integration API improvement by @michaelbenayoun in https://github.com/huggingface/optimum/pull/515

Custom shapes support at ONNX export

The shape of the example input to provide for the export to ONNX can be overridden in case the validity of the ONNX model is sensitive to the shape used during the export.

Read more: optimum-cli export onnx --help

Enable use_cache=True for ORTModelForCausalLM

Reusing past key values for models using ORTModelForCausalLM (e.g. gpt2) is now possible using use_cache=True, avoiding to recompute them at each iteration of the decoding:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = ORTModelForCausalLM.from_pretrained("gpt2", from_transformers=True, use_cache=True)

inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")

gen_tokens = model.generate(**inputs)
tokenizer.batch_decode(gen_tokens)

IO binding support for ORTModelForCustomTasks

ORTModelForCustomTasks now supports IO Binding when using CUDAExecutionProvider.

Experimental support to merge ONNX decoder with/without past key values

Along with --for-ort, when passing --task causal-lm-with-past , --task seq2seq-with-past or --task speech2seq-lm-with-past during the ONNX export exports two models: one not using the previously computed keys/values, and one using them.

An experimental support is introduced to merge the two models in one. Example:

optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_onnx/
import onnx
from optimum.onnx import merge_decoders

decoder = onnx.load("t5_onnx/decoder_model.onnx")
decoder_with_past = onnx.load("t5_onnx/decoder_with_past_model.onnx")

merged_model = merge_decoders(decoder, decoder_with_past)
onnx.save(merged_model, "t5_onnx/decoder_merged_model.onnx")

Major bugs fixed

Other changes, bugfixes and improvements

Full Changelog: https://github.com/huggingface/optimum/compare/v1.5.2...v1.6.0

Significant community contributions

The following contributors have made significant changes to the library over the last release:

optimum - v1.5.2: Patch release

Published by fxmarty almost 2 years ago

Constraint temporarily numpy<1.24.0 (#614)

optimum - v1.5.1: Patch release

Published by fxmarty almost 2 years ago

Deprecate PyTorch 1.12. for BetterTransformer with better error message (#513)

BetterTransformer

Convert your model into its PyTorch BetterTransformer format using a one liner with the new BetterTransformer integration for faster inference on CPU and GPU!

from optimum.bettertransformer import BetterTransformer

model = BetterTransformer.transform(model)

Check the full list of supported models in the documentaiton, and check out the Google Colab demo.

Contributions

  • BetterTransformer integration (#423)
  • ViT and Wav2Vec2 support (#470)

ONNX Runtime IOBinding support

ORT models (except for ORTModelForCustomTasks) now support IOBinding to avoid data copying overheads between the host and device. Significant inference speedup during the decoding process on GPU.

By default, use_io_binding is set to True when using CUDA. You can turn off the IOBinding in case of any memory issue:

from optimum.onnxruntime import ORTModelForSeq2SeqLM

model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small", use_io_binding=False)

Contributions

  • Add IOBinding support to ONNX Runtime module (#421)

Optimum Exporters

optimum.exporters is a new module that handles the export of PyTorch and TensorFlow models to several backends. Only ONNX is supported for now, and more than 50 architectures can already be exported, among which BERT, GPT-Neo, Bloom, T5, ViT, Whisper, CLIP.

The export can be done via the CLI:

python -m optimum.exporters.onnx --model openai/whisper-tiny.en whisper_onnx/

For more information, check the documentation.

Contributions

  • optimum.exporters creation (#403)
  • Automatic task detection (#445)

Whisper

  • Whisper can be exported to ONNX using optimum.exporters.
  • Whisper can also be exported and ran using optimum.onnxruntime, IO binding is also supported.

Note: For the now the export from optimum.exporters will not be usable by ORTModelForSpeechSeq2Seq. To be able to run inference, export Whisper directly using ORTModelForSpeechSeq2Seq. This will be solved in the next release.

Contributions

  • Whisper support with optimum.onnxruntime and optimum.exporters (#420)

Other contributions

  • ONNX Runtime training now supports ORT 1.13.1 and transformers 4.23.1 (#434)
  • ORTModel can load models from subfolders in a similar fashion as in transformers (#443)
  • ORTOptimizer has been refactored, and a factory class has been added to create common OptimizationConfigs (#457)
  • Fixes and updates in the documentation (#411, #432, #437, #441)
  • Fixes IOBinding (#454, #461)