🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
APACHE-2.0 License
Bot releases are visible (Hide)
Published by echarlaix almost 2 years ago
ORTModel
to ORTTrainer
and ORTSeq2SeqTrainer
#189InferenceSession
options and provider to ORTModel
#271ORTOptimizer
torch.fx
transformations #348torch.fx
transformations now use the marking methods mark_as_transformed
, mark_as_restored
, get_transformed_nodes
#385BaseConfig
for transformers
4.22.0
release #386ORTTrainer
for transformers
4.22.1
release #388provider_options
to ORTModel
#401ORTModel
, as transformers
does for pipelines #427Published by echarlaix about 2 years ago
ORTQuantizer
(#270) and ORTOptimizer
(#294)ORTModelForCustomTasks
allowing ONNX Runtime inference support for custom tasks (#303)ORTModelForMultipleChoice
allowing ONNX Runtime inference for models with multiple choice classification head (#358)FuseBiasInLinear
a transformation that fuses the weight and the bias of linear modules (#253)past_key_values
during ONNX Runtime inference of Seq2Seq models (#241)Published by echarlaix over 2 years ago
The optimum.fx.optimization
module (#232) provides a set of torch.fx
graph transformations, along with classes and functions to write your own transformations and compose them.
Transformation
and ReversibleTransformation
represent non-reversible and reversible transformations, and it is possible to write such transformations by inheriting from those classescompose
utility function enables transformation compositionMergeLinears
: merges linear layers that have the same inputChangeTrueDivToMulByInverse
: changes a division by a static value to a multiplication of its inverseORTModelForSeq2SeqLM
(#199) allows ONNX export and ONNX Runtime inference for Seq2Seq models.
Below is an example that downloads a T5 model from the Hugging Face Hub, exports it through the ONNX format and saves it :
from optimum.onnxruntime import ORTModelForSeq2SeqLM
# Load model from hub and export it through the ONNX format
model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True)
# Save the exported model in the given directory
model.save_pretrained(output_dir)
ORTModelForImageClassification
(#226) allows ONNX Runtime inference for models with an image classification head.
Below is an example that downloads a ViT model from the Hugging Face Hub, exports it through the ONNX format and saves it :
from optimum.onnxruntime import ORTModelForImageClassification
# Load model from hub and export it through the ONNX format
model = ORTModelForImageClassification.from_pretrained("google/vit-base-patch16-224", from_transformers=True)
# Save the exported model in the given directory
model.save_pretrained(output_dir)
Adds support for converting model weights from fp32 to fp16 by adding a new optimization parameter (fp16
) to OptimizationConfig
(#273).
Additional pipelines tasks are now supported, here is a list of the supported tasks along with the default model for each:
Below is an example that downloads a T5 small model from the Hub and loads it with transformers pipeline for translation :
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("optimum/t5-small")
model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small")
onnx_translation = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer)
text = "What a beautiful day !"
pred = onnx_translation(text)
# [{'translation_text': "C'est une belle journée !"}]
The ORTModelForXXX
execution provider default value is now set to CPUExecutionProvider
(#203). Before, if no execution provider was provided, it was set to CUDAExecutionProvider
if a gpu was detected, or to CPUExecutionProvider
otherwise.
Published by echarlaix over 2 years ago
optimum-intel
(#212)ORTModel
optimized and quantized models (#214)Published by echarlaix over 2 years ago
QuantizationPreprocessor
to dynamic quantization (https://github.com/huggingface/optimum/pull/196)huggingface_hub
version and protobuf
fix (https://github.com/huggingface/optimum/pull/205)Published by echarlaix over 2 years ago
Add support to Python version 3.7 (https://github.com/huggingface/optimum/pull/176)
Published by echarlaix over 2 years ago
ORTModelForXXX
classes such as ORTModelForSequenceClassification
were integrated with the Hugging Face Hub in order to easily export models through the ONNX format, load ONNX models, as well as easily save the resulting model and push it to the 🤗 Hub by using respectively the save_pretrained
and push_to_hub
methods. An already optimized and / or quantized ONNX model can also be loaded using the ORTModelForXXX classes using the from_pretrained
method.
Below is an example that downloads a DistilBERT model from the Hub, exports it through the ONNX format and saves it :
from optimum.onnxruntime import ORTModelForSequenceClassification
# Load model from hub and export it through the ONNX format
model = ORTModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english",
from_transformers=True
)
# Save the exported model
model.save_pretrained("a_local_path_for_convert_onnx_model")
Built-in support for transformers pipelines was added. This allows us to leverage the same API used from Transformers, with the power of accelerated runtimes such as ONNX Runtime.
The currently supported tasks with the default model for each are the following :
Below is an example that downloads a RoBERTa model from the Hub, exports it through the ONNX format and loads it with transformers
pipeline for question-answering
.
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering
# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2",from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
# test the model with using transformers pipeline, with handle_impossible_answer for squad_v2
optimum_qa = pipeline(task, model=model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = optimum_qa(
question="What's my name?", context="My name is Philipp and I live in Nuremberg."
)
print(prediction)
# {'score': 0.9041663408279419, 'start': 11, 'end': 18, 'answer': 'Philipp'}
ORTTrainer
, previously not enabled when inference was performed with ONNX Runtime in #152
Published by JingyaHuang over 2 years ago
ORTModel
.IncludeFullyConnectedNodes
class to find the nodes composing the fully connected layers in order to (only) target the latter for quantization to limit the accuracy drop.QuantizationPreprocessor
so that the intersection of the two sets representing the nodes to quantize and the nodes to exclude from quantization to be an empty set.Seq2SeqORTTrainer
to ORTSeq2SeqTrainer
for clarity and to keep consistency.ORTOptimizer
support for ELECTRA models.ORTConfig
which contains optimization and quantization config.Published by echarlaix over 2 years ago
The ORTTrainer
and Seq2SeqORTTrainer
are two newly experimental classes.
ORTTrainer
and Seq2SeqORTTrainer
were created to have a similar user-facing API as the Trainer
and Seq2SeqTrainer
of the Transformers library.ORTTrainer
allows the usage of the ONNX Runtime backend to train a given PyTorch model in order to accelerate training. ONNX Runtime will run the forward and backward passes using an optimized automatically-exported ONNX computation graph, while the rest of the training loop is executed by native PyTorch.ORTTrainer
allows the usage of ONNX Runtime inferencing during both the evaluation and the prediction step.Seq2SeqORTTrainer
, ONNX Runtime inferencing is incompatible with --predict_with_generate
, as the generate method is not supported yet.The ORTQuantizer
and ORTOptimizer
classes underwent a massive refactoring that should allow a simpler and more flexible user-facing API.
ORTQuantizer
method partial_fit
. This is especially useful when using memory-hungry calibration methods such as Entropy and Percentile methods.OptimizationConfig
, QuantizationConfig
and CalibrationConfig
were added in order to better segment the different ONNX Runtime related parameters instead of having one unique configuration ORTConfig
.QuantizationPreprocessor
class was added in order to find the nodes to include and / or exclude from quantization, by finding the nodes following a given pattern (such as the nodes forming LayerNorm for example). This is particularly useful in the context of static quantization, where the quantization of modules such as LayerNorm or GELU are responsible of important drop in accuracy.Published by echarlaix over 2 years ago
ORTConfig
class was introduced, allowing the user to define the desired export, optimization and quantization strategies.ORTOptimizer
class takes care of the model's ONNX export as well as the graph optimization provided by ONNX Runtime. In order to create an instance of ORTOptimizer
, the user needs to provide an ORTConfig
object, defining the export and graph-level transformations informations. Then optimization can be perfomed by calling the ORTOptimizer.fit
method.ORTQuantizer
class. In order to create an instance of ORTQuantizer
, the user needs to provide an ORTConfig
object, defining the export and quantization informations, such as the quantization approach to use or the activations and weights data types. Then quantization can be applied by calling the ORTQuantizer.fit
method.We have also added a new class called IncOptimizer
which will take care of combining the pruning and the quantization processes.
Published by echarlaix over 2 years ago
With this release, we enable Intel Neural Compressor v1.8 magnitude pruning for a variety of NLP tasks with the introduction of IncTrainer
which handles the pruning process.
Published by echarlaix almost 3 years ago
With this release, we enable Intel Neural Compressor v1.7 PyTorch dynamic, post-training and aware-training quantization for a variety of NLP tasks. This support includes the overall process, from quantization application to the loading of the resulting quantized model. The latter being enabled by the introduction of the IncQuantizedModel
class.
Published by mfuntowicz about 3 years ago
Initial release for early access to Optimum library featuring Intel's LPOT quantization and pruning support.