🤗 Hugging Face Inference Toolkit for Google Cloud Vertex AI

[!WARNING] This is still very at a very early stage and subject to major changes.

Features

🤗 Straight forward way of deploying models from the Hugging Face Hub in Vertex AI
🐳 Automatically build Custom Prediction Routines (CPR) for Hugging Face Hub models using transformers.pipeline
📦 Everything is packaged within a single method, providing more flexibility and ease of usage than the former google-cloud-aiplatform SDK for custom models
🔌 Seamless integration for running inference on top of any model from the Hugging Face Hub in Vertex AI thanks to transformers
🌅 Support for diffusers models too!
🔍 Includes custom logging messages for better monitoring and debugging via Google Cloud Logging

Get started

Install the gcloud CLI and authenticate with your Google Cloud account as:

gcloud init
gcloud auth login

Then install vertex-ai-huggingface-inference-toolkit via pip install:

pip install vertex-ai-huggingface-inference-toolkit>=0.0.2

Or via uv pip install for faster installations using uv:

uv pip install vertex-ai-huggingface-inference-toolkit>=0.0.2

Example

from vertex_ai_huggingface_inference_toolkit import TransformersModel

model = TransformersModel(
    model_name_or_path="facebook/bart-large-mnli",
    framework="torch",
    framework_version="2.2.0",
    transformers_version="4.38.2",
    python_version="3.10",
    cuda_version="12.3.0",
    environment_variables={
        "HF_TASK": "zero-shot-classification",
    },
)
model.deploy(
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
)

Once deployed we can send request to it via cURL:

curl -X POST -H "Content-Type: application/json" -d '{"sequences": "Messi is the GOAT", "candidate_labels": ["football", "basketball", "baseball"]}' <VERTEX_AI_ENDPOINT_URL>/predict

from vertex_ai_huggingface_inference_toolkit import TransformersModel

model = TransformersModel(
    model_name_or_path="facebook/bart-large-mnli",
    framework="torch",
    framework_version="2.1.0",
    python_version="3.9",
    cuda_version="11.8.0",
    environment_variables={
        "HF_TASK": "zero-shot-classification",
    },
)

To ensure the consistency of the following approach, the image should have been generated using vertex_ai_huggingface_inference_toolkit in advance.

from vertex_ai_huggingface_inference_toolkit import TransformersModel

model = TransformersModel(
    model_name_or_path="facebook/bart-large-mnli",
    image_uri="us-east1-docker.pkg.dev/huggingface-cloud/vertex-ai-huggingface-inference-toolkit/py3.11-cu12.3.0-torch-2.2.0-transformers-4.38.2:latest",
    environment_variables={
        "HF_TASK": "zero-shot-classification",
    },
)

from vertex_ai_huggingface_inference_toolkit import TransformersModel

model = TransformersModel(
    project_id="my-project",
    location="us-east1",
    model_name_or_path="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    model_kwargs={"torch_dtype": "float16", "attn_implementation": "flash_attention_2"},
    extra_requirements=["flash-attn --no-build-isolation"],
    environment_variables={
        "HF_TASK": "text-generation",
    },
)

References / Acknowledgements

This work is heavily inspired by sagemaker-huggingface-inference-toolkit early work from Philipp Schmid, Hugging Face, and Amazon Web Services.

Package Rankings

Top 37.09% on Pypi.org

Related Projects

longformer

Longformer: The Long-Document Transformer

31 Mar 2020 1,970

FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vi...

19 Mar 2023 36,628

hmBench

hmBench: Fine-Tuning, Evaluating & Benchmarking of Historic Language Models on NER Datasets

05 Oct 2023 4

starcoder

Home of StarCoder: fine-tuning & inference!

24 Apr 2023 7,267

grouphug

Multi-task modelling extensions for huggingface transformers

19 Jun 2022 12

PixArt-alpha

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

12 Oct 2023 2,138

IF

DeepFloyd-IF (Imagen Free)

20 Jan 2023 7,656

transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer ...

31 Oct 2021 1,644

llmware

Providing enterprise-grade LLM-based development framework, tools, and fine-tuned models.

29 Sep 2023 3,057

aidapter

Adapter / facade for language models (OpenAI, Anthropic, Cohere, local transformers, etc)

23 May 2023 18

video-transformers

Easiest way of fine-tuning HuggingFace video classification models

12 Aug 2022 131

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating poin...

20 Sep 2022 1,482

mm-cot

Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tune...

02 Feb 2023 3,760