transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

APACHE-2.0 License

Stars

1.6K

View Code on GitHub Visit Website View on X

Ecosystems: Python

Bot releases are hidden (Show)

transformer-deploy - Add GPT-2 acceleration support Latest Release

Published by pommedeterresautee over 2 years ago

add support for decoder based model (GPT-2) on both ONNX Runtime and TensorRT
refactor triton configuration generation (simplification)
add GPT-2 model documentation (notebook)
fix CPU quantization benchmark (was not using the quant model)
fix sentence transformers bug

transformer-deploy - add CPU support and generic GPU quantization support

Published by pommedeterresautee almost 3 years ago

What's Changed

Update requirements_gpu.txt by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/22
refactoring by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/27
add CPU inference support by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/28
Add QAT support to more models by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/29

Full Changelog: https://github.com/ELS-RD/transformer-deploy/compare/v0.2.0...v0.3.0

transformer-deploy - add GPU quantization support

Published by pommedeterresautee almost 3 years ago

support int-8 GPU quantization
add a tuto to perform quantization end to end
add QDQRoberta model
switch to ONNX opset 13
refactoring in the TensorRT engine creation
fix bugs
add auth token (for private HF repo)

What's Changed

Update triton by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/11
fix README.md by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/13
Fix install errors by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/20
Add auth token by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/19
Support GPU INT-8 quantization by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/15

New Contributors

@sam-writer made their first contribution in https://github.com/ELS-RD/transformer-deploy/pull/20

Full Changelog: https://github.com/ELS-RD/transformer-deploy/compare/v0.1.1...v0.2.0

transformer-deploy - update Triton image to 21.11-py3

Published by pommedeterresautee almost 3 years ago

update Docker image
update documentation

transformer-deploy - from PoC to library

Published by pommedeterresautee almost 3 years ago

switch from a proof of concept to a library
add support for TensorRT Python API (for best performances)
improve documentation (separate Hugging Face Infinity thing from the doc, add benchmark, etc.)
fix issues with mixed precision
add license
add tests, Github actions, Makefile
change the way the Docker image is built

transformer-deploy - first release

Published by pommedeterresautee almost 3 years ago

all the scripts to reproduce https://medium.com/p/e1be0057a51c

Badges

Extracted from project README

Related Projects

Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

02 Jul 2021 1,323

llmware

Providing enterprise-grade LLM-based development framework, tools, and fine-tuned models.

29 Sep 2023 3,057

Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

03 Aug 2023 11,522

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating poin...

20 Sep 2022 1,482

tensorrt_demos

TensorRT MODNet, YOLOv4, YOLOv3, SSD, MTCNN, and GoogLeNet

19 May 2019 1,738

MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone

29 Jan 2024 6,019

starcoder

Home of StarCoder: fine-tuning & inference!

24 Apr 2023 7,267

starcoder2

Home of StarCoder2!

08 Dec 2023 1,732

nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.

28 Dec 2022 32,417

functionary

Chat language model that can use tools and interpret the results

11 Jul 2023 1,372

diffusers-torchao

End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 ...

05 Aug 2024 166

server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

04 Oct 2018 7,680