Quick-Deploy

quick-deploy provide tools to optimize, convert and deploy machine learning models as fast inference API (low latency and high throughput) by Triton Inference Server using Onnx Runtime backend. It support 🤗 transformers, PyToch, Tensorflow, SKLearn and XGBoost models.

Get Started

Let's see an quick example by deploying bert transformers for GPU inference. quick-deploy already have support 🤗 transformers so we can specify the path of pretrained model or just the name from the Hub:

$ quick-deploy transformers \
    -n my-bert-base \
    -p text-classification \
    -m bert-base-uncased \
    -o ./models \
    --model-type bert \
    --seq-len 128 \
    --cuda

The command above created the deployment artifacts by optimizing and converting the model to Onxx. Next just run the inference server:

$ docker run -it --rm \
    --gpus all \
    --shm-size 256m \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    -v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:21.11-py3 \
    tritonserver --model-repository=/models

Now we can use tritonclient which uses gRPC calls to consume our model:

import numpy as np
import tritonclient.http
from scipy.special import softmax
from transformers import BertTokenizer, TensorType


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model_name = "my_bert_base"
url = "127.0.0.1:8000"
model_version = "1"
batch_size = 1

text = "The goal of life is [MASK]."
tokens = tokenizer(text=text, return_tensors=TensorType.NUMPY)

triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)
assert triton_client.is_model_ready(
    model_name=model_name, model_version=model_version
), f"model {model_name} not yet ready"

input_ids = tritonclient.http.InferInput(name="input_ids", shape=(batch_size, 9), datatype="INT64")
token_type_ids = tritonclient.http.InferInput(name="token_type_ids", shape=(batch_size, 9), datatype="INT64")
attention = tritonclient.http.InferInput(name="attention_mask", shape=(batch_size, 9), datatype="INT64")
model_output = tritonclient.http.InferRequestedOutput(name="output", binary_data=False)

input_ids.set_data_from_numpy(tokens['input_ids'] * batch_size)
token_type_ids.set_data_from_numpy(tokens['token_type_ids'] * batch_size)
attention.set_data_from_numpy(tokens['attention_mask'] * batch_size)

response = triton_client.infer(
    model_name=model_name,
    model_version=model_version,
    inputs=[input_ids, token_type_ids, attention],
    outputs=[model_output],
)

token_logits = response.as_numpy("output")
print(token_logits)

Note: This does only model deployment the tokenizer and post-processing should be done in the client side. The full tansformers deployment is comming soon.

For more use cases please check the examples page.

Install

Before install make sure to install just the target model eg.: "torch", "sklearn" or "all". There two options to use quick-deploy, by docker container:

$ docker run --rm -it rodrigobaron/quick-deploy:0.1.1-all --help

or install the python library quick-deploy:

$ pip install quick-deploy[all]

Note: This will install the full vesion all.

Contributing

Please follow the Contributing guide.

License

Apache License 2.0

Package Rankings

Top 27.78% on Pypi.org

Related Projects

TanD

TanD - Train and Deploy is a no-code framework to automatize the Machine Learning workflow.

17 Aug 2020 9

sklearn-porter

Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

22 Jun 2016 1,284

mlserve

mlserve turns your python models into RESTful API, serves web page with form generated to match y...

28 Jul 2018 27

traintool

🔧 Train off-the-shelf machine learning models in one line of code

30 Sep 2020 12

onnxmltools

ONNXMLTools enables conversion of models to ONNX

16 Feb 2018 998

amazon-sagemaker-local-mode

Amazon SageMaker Local Mode Examples

05 Nov 2020 245

eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

11 Jun 2019 641

modelstore

🏬 modelstore is a Python library that allows you to version, export, and save a machine learning ...

20 Sep 2020 372

auto_ml

[UNMAINTAINED] Automated machine learning for analytics & production

07 Aug 2016 1,641

MLServer

An inference server for your machine learning models, including support for multiple frameworks, ...

16 Jun 2020 504

ATOM

Automated Tool for Optimized Modelling

03 Jul 2019 152

onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

10 Nov 2018 14,237

stormtrooper

Zero/few shot learning components for scikit-learn pipelines with LLMs and transformers.

09 Aug 2023 10

tempo

MLOps Python Library

27 Jan 2021 112

promote-python

Python library for deploying models built using Python to Alteryx Promote.

09 Sep 2017 16