Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and π video, up to 5x faster than OpenAI CLIP and LLaVA πΌοΈ & ποΈ
APACHE-2.0 License
Bot releases are visible (Hide)
Published by ashvardanian 6 months ago
How many AI models can run on-device out of the box? UForm multimodal embeddings can π₯³
Model | Parameters | Languages | Architecture |
---|---|---|---|
uform3-image-text-english-large π |
365M | 1 | 6 text layers, ViT-L/14, 6 multimodal layers |
uform3-image-text-english-base |
143M | 1 | 2 text layers, ViT-B/16, 2 multimodal layers |
uform3-image-text-english-small π |
79M | 1 | 2 text layers, ViT-S/16, 2 multimodal layers |
uform3-image-text-multilingual-base |
206M | 21 | 8 text layers, ViT-B/16, 4 multimodal layers |
Load the models and preprocessors for different modalities:
import { getModel, Modality, TextProcessor, TextEncoder, ImageEncoder, ImageProcessor } from '@unum-cloud/uform';
const { configPath, modalityPaths, tokenizerPath } = await getModel({
modelId: 'unum-cloud/uform3-image-text-english-small',
modalities: [Modality.TextEncoder, Modality.ImageEncoder],
});
Embed images:
const imageProcessor = new ImageProcessor(configPath);
await imageProcessor.init();
const processedImages = await imageProcessor.process("path/to/image.png");
const imageEncoder = new ImageEncoder(modalityPaths.image_encoder, imageProcessor);
await imageEncoder.init();
const imageOutput = await imageEncoder.encode(processedImages);
assert(imageOutput.embeddings.dims.length === 2, "Output should be 2D");
Embed queries:
const textProcessor = new TextProcessor(configPath, tokenizerPath);
await textProcessor.init();
const processedTexts = await textProcessor.process("a small red panda in a zoo");
const textEncoder = new TextEncoder(modalityPaths.text_encoder, textProcessor);
await textEncoder.init();
const textOutput = await textEncoder.encode(processedTexts);
assert(textOutput.embeddings.dims.length === 2, "Output should be 2D");
await textEncoder.dispose();
Embed images:
let imageModel = try await ImageEncoder(modelName: "unum-cloud/uform3-image-text-english-small")
let imageURL = "https://github.com/ashvardanian/ashvardanian/blob/master/demos/bbq-on-beach.jpg?raw=true"
guard let url = URL(string: imageURL),
let imageSource = CGImageSourceCreateWithURL(url as CFURL, nil),
let cgImage = CGImageSourceCreateImageAtIndex(imageSource, 0, nil) {
throw Exception("Could not load image from URL: \(imageURL)")
}
var imageEmbedding: Embedding = try imageModel.encode(cgImage)
var imageVector: [Float32] = embedding.asFloats()
Embed queries:
let textModel = try await TextEncoder(modelName: "unum-cloud/uform3-image-text-english-small")
let text = "A group of friends enjoy a barbecue on a sandy beach, with one person grilling over a large black grill, while the other sits nearby, laughing and enjoying the camaraderie."
let textEmbedding: Embedding = try textModel.encode(text)
let textVector: [Float32] = textEmbedding.asFloats()
Load model:
from uform import get_model, Modality
model_name = 'unum-cloud/uform3-image-text-english-small'
modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
processors, models = get_model(model_name, modalities=modalities)
Embed images:
import requests
from io import BytesIO
from PIL import Image
image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image = Image.open(BytesIO(requests.get(image_url).content))
processor_image = processors[Modality.IMAGE_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
image_data = processor_image(image)
image_features, image_embedding = model_image.encode(image_data, return_features=True)
Embed queries:
text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
model_text = models[Modality.TEXT_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
text_data = processor_text(text)
text_features, text_embedding = model_text.encode(text_data, return_features=True)
Thanks to @xenova and @sroussey for help with JavaScript!
Thanks to @vmanot and @pcuenca for their work on Swift!
Published by ashvardanian 7 months ago
Today we are releasing a new batch of multimodal models trained with Nebius and already available on HuggingFace π€
Published by ashvardanian 8 months ago
Great thanks to @lmmx, @blackforestboi, and @kapulkin for their patches to the project!
Published by ashvardanian 10 months ago
The UForm family of tiny multimodal transformer models just got bigger! In addition to the existing CLIP-like embedding models, we now have a generative model useful for image captioning, visual question answering, and multimodal chats. All that is in just a billion parameters, small enough to fit even on mobile devices π
Repository: https://github.com/unum-cloud/uform
Generative model: https://huggingface.co/unum-cloud/uform-gen
Chat model: https://huggingface.co/unum-cloud/uform-gen-chat
Being the smallest model of its kind, unum-cloud/uform-gen
is hard to compare to others. Next in size are the 5x larger LLaVAs and InstructBLIP, with 7 billion parameters. LLaVA performs noticeably better on VQAv2: 78.5 vs 66.5. On captioning, CLIPScore and RefCLIPScore are relatively close across all models.
Model | Size | Caption Length | CLIPScore | RefCLIPScore |
---|---|---|---|---|
llava-hf/llava-1.5-7b-hf |
7B | Long | 0.878 | 0.529 |
llava-hf/llava-1.5-7b-hf |
7B | Short | 0.886 | 0.531 |
Salesforce/instructblip-vicuna-7b |
7B | Long | 0.902 | 0.534 |
Salesforce/instructblip-vicuna-7b |
7B | Short | 0.848 | 0.523 |
unum-cloud/uform-gen |
1.5B | Long | 0.847 | 0.523 |
unum-cloud/uform-gen |
1.5B | Short | 0.842 | 0.522 |
unum-cloud/uform-gen-chat |
1.5B | Long | 0.860 | 0.525 |
unum-cloud/uform-gen-chat |
1.5B | Short | 0.858 | 0.525 |
On RTX 3090, using vanilla PyTorch for inference, with bfloat16
arithmetic and greedy decoding, one should expect the following numbers for throughput.
Model | Size | Speed | Speedup |
---|---|---|---|
llava-hf/llava-1.5-7b-hf |
7B | ~ 40 tokens/second | |
Salesforce/instructblip-vicuna-7b |
7B | ~ 40 tokens/second | |
unum-cloud/uform-gen |
1.5B | ~ 140 tokens/second | x 3.5 |