uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and πŸ”œ video, up to 5x faster than OpenAI CLIP and LLaVA πŸ–ΌοΈ & πŸ–‹οΈ

APACHE-2.0 License

Downloads
1.3K
Stars
1K
Committers
14

Bot releases are visible (Hide)

uform - v3.0.2 Latest Release

Published by ashvardanian 6 months ago

3.0.2 (2024-04-25)

Make

uform - v3.0.1

Published by ashvardanian 6 months ago

3.0.1 (2024-04-25)

Make

uform - UForm v3 for 3 platforms πŸ•ΈοΈπŸπŸ

Published by ashvardanian 6 months ago

Multimodal Embeddings for JavaScript, Swift, and Python

How many AI models can run on-device out of the box? UForm multimodal embeddings can πŸ₯³

Model Parameters Languages Architecture
uform3-image-text-english-large πŸ†• 365M 1 6 text layers, ViT-L/14, 6 multimodal layers
uform3-image-text-english-base 143M 1 2 text layers, ViT-B/16, 2 multimodal layers
uform3-image-text-english-small πŸ†• 79M 1 2 text layers, ViT-S/16, 2 multimodal layers
uform3-image-text-multilingual-base 206M 21 8 text layers, ViT-B/16, 4 multimodal layers

JavaScript

Load the models and preprocessors for different modalities:

import { getModel, Modality, TextProcessor, TextEncoder, ImageEncoder, ImageProcessor } from '@unum-cloud/uform';

const { configPath, modalityPaths, tokenizerPath } = await getModel({
    modelId: 'unum-cloud/uform3-image-text-english-small',
    modalities: [Modality.TextEncoder, Modality.ImageEncoder],
});

Embed images:

const imageProcessor = new ImageProcessor(configPath);
await imageProcessor.init();
const processedImages = await imageProcessor.process("path/to/image.png");

const imageEncoder = new ImageEncoder(modalityPaths.image_encoder, imageProcessor);
await imageEncoder.init();
const imageOutput = await imageEncoder.encode(processedImages);
assert(imageOutput.embeddings.dims.length === 2, "Output should be 2D");

Embed queries:

const textProcessor = new TextProcessor(configPath, tokenizerPath);
await textProcessor.init();
const processedTexts = await textProcessor.process("a small red panda in a zoo");

const textEncoder = new TextEncoder(modalityPaths.text_encoder, textProcessor);
await textEncoder.init();
const textOutput = await textEncoder.encode(processedTexts);
assert(textOutput.embeddings.dims.length === 2, "Output should be 2D");
await textEncoder.dispose();

Swift

Embed images:

let imageModel = try await ImageEncoder(modelName: "unum-cloud/uform3-image-text-english-small")
let imageURL = "https://github.com/ashvardanian/ashvardanian/blob/master/demos/bbq-on-beach.jpg?raw=true"
guard let url = URL(string: imageURL),
    let imageSource = CGImageSourceCreateWithURL(url as CFURL, nil),
    let cgImage = CGImageSourceCreateImageAtIndex(imageSource, 0, nil) {
    throw Exception("Could not load image from URL: \(imageURL)")
}

var imageEmbedding: Embedding = try imageModel.encode(cgImage)
var imageVector: [Float32] = embedding.asFloats()

Embed queries:

let textModel = try await TextEncoder(modelName: "unum-cloud/uform3-image-text-english-small")
let text = "A group of friends enjoy a barbecue on a sandy beach, with one person grilling over a large black grill, while the other sits nearby, laughing and enjoying the camaraderie."
let textEmbedding: Embedding = try textModel.encode(text)
let textVector: [Float32] = textEmbedding.asFloats()

Python

Load model:

from uform import get_model, Modality

model_name = 'unum-cloud/uform3-image-text-english-small'
modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
processors, models = get_model(model_name, modalities=modalities)

Embed images:

import requests
from io import BytesIO
from PIL import Image

image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image = Image.open(BytesIO(requests.get(image_url).content))

processor_image = processors[Modality.IMAGE_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
image_data = processor_image(image)
image_features, image_embedding = model_image.encode(image_data, return_features=True)

Embed queries:

text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'

model_text = models[Modality.TEXT_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]

text_data = processor_text(text)
text_features, text_embedding = model_text.encode(text_data, return_features=True)

Thanks to @xenova and @sroussey for help with JavaScript!
Thanks to @vmanot and @pcuenca for their work on Swift!

uform - v2.1.1

Published by ashvardanian 6 months ago

2.1.1 (2024-04-16)

Fix

  • Importing ViT in gen_model.py (#80) (21f49ba), closes #80
uform - v2.1.0

Published by ashvardanian 6 months ago

2.1.0 (2024-04-14)

Add

Fix

  • Image preprocessing in Swift (f2772d0)

Improve

  • Fetching nested configs (729b9d9)

Make

uform - v2.0.2

Published by ashvardanian 7 months ago

2.0.2 (2024-03-28)

Make

  • Fix PyPi CI version with hash (364afe6)
uform - v2.0.1

Published by ashvardanian 7 months ago

2.0.1 (2024-03-28)

Make

uform - Multimodal Matryoshka, Multimodal DPO, and ONNX πŸŽ‰

Published by ashvardanian 7 months ago

DPO Preview

Today we are releasing a new batch of multimodal models trained with Nebius and already available on HuggingFace πŸ€—

  1. Matryoshka style multimodal embeddings ranging from 64 to 256 and 768 dimensions πŸ–ΌοΈ
  2. Improved multimodal chat in 1.2B parameters, tuned with Direct Preference Optimization πŸ’¬
  3. ONNX backend, making PyTorch dependency optional for lightning fast deployments ⚑
uform - v1.1.1: Polishing the Repo

Published by ashvardanian 8 months ago

Great thanks to @lmmx, @blackforestboi, and @kapulkin for their patches to the project!


  • Performance observations for M2 CPUs (#56) (8374ef6), closes #56
  • Passing labels to text_decoder to compute loss. (#65) (f445a8b), closes #65
  • Larger batch benchmarks (fdc8587)
  • pre-commit config and linters (#62) (0a3efac), closes #62
uform - v1.1.0

Published by ashvardanian 8 months ago

1.1.0 (2024-02-15)

Add

uform - v1.0.3

Published by ashvardanian 10 months ago

1.0.3 (2023-12-29)

Improve

uform - v1.0.2

Published by ashvardanian 10 months ago

1.0.2 (2023-12-28)

Make

uform - UForm v1: Multimodal Chat in 1.5 Billion Parameters

Published by ashvardanian 10 months ago

UForm v1: Multimodal Chat in 1.5 Billion Parameters

The UForm family of tiny multimodal transformer models just got bigger! In addition to the existing CLIP-like embedding models, we now have a generative model useful for image captioning, visual question answering, and multimodal chats. All that is in just a billion parameters, small enough to fit even on mobile devices πŸŽ‰

Repository: https://github.com/unum-cloud/uform
Generative model: https://huggingface.co/unum-cloud/uform-gen
Chat model: https://huggingface.co/unum-cloud/uform-gen-chat

Evaluation Metrics

Being the smallest model of its kind, unum-cloud/uform-gen is hard to compare to others. Next in size are the 5x larger LLaVAs and InstructBLIP, with 7 billion parameters. LLaVA performs noticeably better on VQAv2: 78.5 vs 66.5. On captioning, CLIPScore and RefCLIPScore are relatively close across all models.

Model Size Caption Length CLIPScore RefCLIPScore
llava-hf/llava-1.5-7b-hf 7B Long 0.878 0.529
llava-hf/llava-1.5-7b-hf 7B Short 0.886 0.531
Salesforce/instructblip-vicuna-7b 7B Long 0.902 0.534
Salesforce/instructblip-vicuna-7b 7B Short 0.848 0.523
unum-cloud/uform-gen 1.5B Long 0.847 0.523
unum-cloud/uform-gen 1.5B Short 0.842 0.522
unum-cloud/uform-gen-chat 1.5B Long 0.860 0.525
unum-cloud/uform-gen-chat 1.5B Short 0.858 0.525

Throughput

On RTX 3090, using vanilla PyTorch for inference, with bfloat16 arithmetic and greedy decoding, one should expect the following numbers for throughput.

Model Size Speed Speedup
llava-hf/llava-1.5-7b-hf 7B ~ 40 tokens/second
Salesforce/instructblip-vicuna-7b 7B ~ 40 tokens/second
unum-cloud/uform-gen 1.5B ~ 140 tokens/second x 3.5
uform - v0.4.8

Published by ashvardanian about 1 year ago

0.4.8 (2023-10-13)

Make

  • pass ANACONDA_API_TOKEN as env. var. (ed020d3)
uform - v0.4.7

Published by ashvardanian about 1 year ago

0.4.7 (2023-10-13)

Make

  • urllib3 after v2 breaks Anaconda pipeline (05ed238)
uform - v0.4.6

Published by ashvardanian about 1 year ago

0.4.6 (2023-10-13)

Make

uform - v0.4.5

Published by ashvardanian about 1 year ago

0.4.5 (2023-10-13)

Make

uform - v0.4.4

Published by ashvardanian about 1 year ago

0.4.4 (2023-09-20)

Docs

Improve

  • Expose TextEncoder and other models classes (47d969b)
uform - v0.4.3

Published by ashvardanian about 1 year ago

0.4.3 (2023-09-01)

Docs

Make

uform - v0.4.2

Published by ashvardanian about 1 year ago

0.4.2 (2023-08-17)

Docs

Fix

  • Sphinx last version not work (c3a0cc7)