transformers.js

State-of-the-art Machine Learning for the web. Run ๐Ÿค— Transformers directly in your browser, with no need for a server!

APACHE-2.0 License

Downloads
290.6K
Stars
11.2K
Committers
32

Bot releases are hidden (Show)

transformers.js - 2.17.2 Latest Release

Published by xenova 5 months ago

๐Ÿš€ What's new?

๐Ÿค— New contributors

Full Changelog: https://github.com/xenova/transformers.js/compare/2.17.1...2.17.2

transformers.js - 2.17.1

Published by xenova 6 months ago

What's new?

Full Changelog: https://github.com/xenova/transformers.js/compare/2.17.0...2.17.1

transformers.js - 2.17.0

Published by xenova 6 months ago

What's new?

๐Ÿ’ฌ Improved text-generation pipeline for conversational models

This version adds support for passing an array of chat messages (with "role" and "content" properties) to the text-generation pipeline (PR). Check out the list of supported models here.

Example: Chat with Xenova/Qwen1.5-0.5B-Chat.

import { pipeline } from '@xenova/transformers';

// Create text-generation pipeline
const generator = await pipeline('text-generation', 'Xenova/Qwen1.5-0.5B-Chat');

// Define the list of messages
const messages = [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Tell me a funny joke.' }
]

// Generate text
const output = await generator(messages, {
    max_new_tokens: 128,
    do_sample: false,
})
console.log(output[0].generated_text);
// [
//   { role: 'system', content: 'You are a helpful assistant.' },
//   { role: 'user', content: 'Tell me a funny joke.' },
//   { role: 'assistant', content: "Sure, here's one:\n\nWhy was the math book sad?\n\nBecause it had too many problems.\n\nI hope you found that joke amusing! Do you have any other questions or topics you'd like to discuss?" },
// ]

We also added the return_full_text parameter, which means if you set return_full_text=false, only the newly-generated tokens will be returned (only applicable if passing the raw text prompt to the pipeline).

๐Ÿ”ข Binary embedding quantization support

Transformers.js v2.17 adds two new parameters to the feature-extraction pipeline ("quantize" and "precision"), enabling you to generate binary embeddings. These can be used with certain embedding models to shrink the size of the document embeddings for retrieval. This results in reductions in index size/memory usage (for storage) and improvements in retrieval speed. Surprisingly, you can still achieve up to ~95% of the original performance, but at 32x storage savings and up to 32x retrieval speeds! ๐Ÿคฏ Thanks to @jonathanpv for this addition in https://github.com/xenova/transformers.js/pull/691!

import { pipeline } from '@xenova/transformers';

// Create feature-extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

// Compute binary embeddings
const output = await extractor('This is a simple test.', { pooling: 'mean', quantize: true, precision: 'binary' });
// Tensor {
//   type: 'int8',
//   data: Int8Arrayย [49, 108, 24, ...],
//   dims: [1, 48]
// }

As you can see, this produces a 32x smaller output tensor (a 4x reduction in data type with Float32Array โ†’ Int8Array, as well as an 8x reduction in dimensionality from 384 โ†’ 48). For more information, check out this PR in sentence-transformers, which inspired this update!

๐Ÿ› ๏ธ Misc. improvements

๐Ÿค— New contributors

Full Changelog: https://github.com/xenova/transformers.js/compare/2.16.1...2.17.0

transformers.js - 2.16.1

Published by xenova 7 months ago

What's new?

  • Add support for the image-feature-extraction pipeline in https://github.com/xenova/transformers.js/pull/650.

    Example: Perform image feature extraction with Xenova/vit-base-patch16-224-in21k.

    const image_feature_extractor = await pipeline('image-feature-extraction', 'Xenova/vit-base-patch16-224-in21k');
    const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png';
    const features = await image_feature_extractor(url);
    // Tensor {
    //   dims: [ 1, 197, 768 ],
    //   type: 'float32',
    //   data: Float32Array(151296) [ ... ],
    //   size: 151296
    // }
    

    Example: Compute image embeddings with Xenova/clip-vit-base-patch32.

    const image_feature_extractor = await pipeline('image-feature-extraction', 'Xenova/clip-vit-base-patch32');
    const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png';
    const features = await image_feature_extractor(url);
    // Tensor {
    //   dims: [ 1, 512 ],
    //   type: 'float32',
    //   data: Float32Array(512) [ ... ],
    //   size: 512
    // }
    
  • Fix channel format when padding non-square images for certain models in https://github.com/xenova/transformers.js/pull/655. This means you can now perform super-resolution for non-square images with APISR models:

    Example: Upscale an image with Xenova/4x_APISR_GRL_GAN_generator-onnx.

    import { pipeline } from '@xenova/transformers';
    
    // Create image-to-image pipeline
    const upscaler = await pipeline('image-to-image', 'Xenova/4x_APISR_GRL_GAN_generator-onnx', {
        quantized: false,
    });
    
    // Upscale an image
    const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/anime.png';
    const output = await upscaler(url);
    // RawImage {
    //   data: Uint8Array(16588800) [ ... ],
    //   width: 2560,
    //   height: 1920,
    //   channels: 3
    // }
    
    // (Optional) Save the upscaled image
    output.save('upscaled.png');
    

    Input image:
    image

    Output image:
    image

  • Update tokenizer apply_chat_template functionality in https://github.com/xenova/transformers.js/pull/647. This PR added functionality to support the new C4AI Command-R tokenizer.

    import { AutoTokenizer } from "@xenova/transformers";
    
    const tokenizer = await AutoTokenizer.from_pretrained("Xenova/c4ai-command-r-v01-tokenizer")
    
    // define conversation input:
    const conversation = [
      { role: "user", content: "Whats the biggest penguin in the world?" }
    ]
    // Define tools available for the model to use:
    const tools = [
      {
        name: "internet_search",
        description: "Returns a list of relevant document snippets for a textual query retrieved from the internet",
        parameter_definitions: {
          query: {
            description: "Query to search the internet with",
            type: "str",
            required: true
          }
        }
      },
      {
        name: "directly_answer",
        description: "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
        parameter_definitions: {}
      }
    ]
    
    
    // render the tool use prompt as a string:
    const tool_use_prompt = tokenizer.apply_chat_template(
      conversation,
      {
        chat_template: "tool_use",
        tokenize: false,
        add_generation_prompt: true,
        tools,
      }
    )
    console.log(tool_use_prompt)
    
    import { AutoTokenizer } from "@xenova/transformers";
    
    const tokenizer = await AutoTokenizer.from_pretrained("Xenova/c4ai-command-r-v01-tokenizer")
    
    // define conversation input:
    const conversation = [
      { role: "user", content: "Whats the biggest penguin in the world?" }
    ]
    // define documents to ground on:
    const documents = [
      { title: "Tall penguins", text: "Emperor penguins are the tallest growing up to 122 cm in height." },
      { title: "Penguin habitats", text: "Emperor penguins only live in Antarctica." }
    ]
    
    // render the RAG prompt as a string:
    const grounded_generation_prompt = tokenizer.apply_chat_template(
      conversation,
      {
        chat_template: "rag",
        tokenize: false,
        add_generation_prompt: true,
    
        documents,
        citation_mode: "accurate", // or "fast"
      }
    )
    console.log(grounded_generation_prompt);
    
  • Add support for EfficientNet in https://github.com/xenova/transformers.js/pull/639.

    Example: Classify images with chriamue/bird-species-classifier

    import { pipeline } from '@xenova/transformers';
    
    // Create image classification pipeline
    const classifier = await pipeline('image-classification', 'chriamue/bird-species-classifier', {
        quantized: false,      // Quantized model doesn't work
        revision: 'refs/pr/1', // Needed until the model author merges the PR
    });
    
    // Classify an image
    const url = 'https://upload.wikimedia.org/wikipedia/commons/7/73/Short_tailed_Albatross1.jpg';
    const output = await classifier(url);
    console.log(output)
    // [{ label: 'ALBATROSS', score: 0.9999023079872131 }]
    

Full Changelog: https://github.com/xenova/transformers.js/compare/2.16.0...2.16.1

transformers.js - 2.16.0

Published by xenova 8 months ago

What's new?

๐Ÿ’ฌ StableLM text-generation models

This version adds support for the StableLM family of text-generation models (up to 1.6B params), developed by Stability AI. Huge thanks to @D4ve-R for this contribution in https://github.com/xenova/transformers.js/pull/616! See here for the full list of supported models.

Example: Text generation with Xenova/stablelm-2-zephyr-1_6b.

import { pipeline } from '@xenova/transformers';

// Create text generation pipeline
const generator = await pipeline('text-generation', 'Xenova/stablelm-2-zephyr-1_6b');

// Define the prompt and list of messages
const prompt = "Tell me a funny joke."
const messages = [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": prompt },
]

// Apply chat template
const inputs = generator.tokenizer.apply_chat_template(messages, {
    tokenize: false,
    add_generation_prompt: true,
});

// Generate text
const output = await generator(inputs, { max_new_tokens: 20 });
console.log(output[0].generated_text);
// "<|system|>\nYou are a helpful assistant.\n<|user|>\nTell me a funny joke.\n<|assistant|>\nHere's a joke for you:\n\nWhy don't scientists trust atoms?\n\nBecause they make up everything!"

Note: these models may be too large to run in your browser at the moment, so for now, we recommend using them in Node.js. Stay tuned for updates on this!

๐Ÿ”‰ Speaker verification and diarization models

Example: Speaker verification w/ Xenova/wavlm-base-plus-sv.

import { AutoProcessor, AutoModel, read_audio, cos_sim } from '@xenova/transformers';

// Load processor and model
const processor = await AutoProcessor.from_pretrained('Xenova/wavlm-base-plus-sv');
const model = await AutoModel.from_pretrained('Xenova/wavlm-base-plus-sv');

// Helper function to compute speaker embedding from audio URL
async function compute_embedding(url) {
    const audio = await read_audio(url, 16000);
    const inputs = await processor(audio);
    const { embeddings } = await model(inputs);
    return embeddings.data;
}

// Generate speaker embeddings
const BASE_URL = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/sv_speaker';
const speaker_1_1 = await compute_embedding(`${BASE_URL}-1_1.wav`);
const speaker_1_2 = await compute_embedding(`${BASE_URL}-1_2.wav`);
const speaker_2_1 = await compute_embedding(`${BASE_URL}-2_1.wav`);
const speaker_2_2 = await compute_embedding(`${BASE_URL}-2_2.wav`);

// Compute similarity scores
console.log(cos_sim(speaker_1_1, speaker_1_2)); // 0.959439158881247 (Both are speaker 1)
console.log(cos_sim(speaker_1_2, speaker_2_1)); // 0.618130172602329 (Different speakers)
console.log(cos_sim(speaker_2_1, speaker_2_2)); // 0.962999814169370 (Both are speaker 2)

Example: Perform speaker diarization with Xenova/wavlm-base-plus-sd.

import { AutoProcessor, AutoModelForAudioFrameClassification, read_audio } from '@xenova/transformers';

// Read and preprocess audio
const processor = await AutoProcessor.from_pretrained('Xenova/wavlm-base-plus-sd');
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const audio = await read_audio(url, 16000);
const inputs = await processor(audio);

// Run model with inputs
const model = await AutoModelForAudioFrameClassification.from_pretrained('Xenova/wavlm-base-plus-sd');
const { logits } = await model(inputs);
// {
//   logits: Tensor {
//     dims: [ 1, 549, 2 ],  // [batch_size, num_frames, num_speakers]
//     type: 'float32',
//     data: Float32Array(1098) [-3.5301010608673096, ...],
//     size: 1098
//   }
// }

const labels = logits[0].sigmoid().tolist().map(
    frames => frames.map(speaker => speaker > 0.5 ? 1 : 0)
);
console.log(labels); // labels is a one-hot array of shape (num_frames, num_speakers)
// [
//     [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0],
//     [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0],
//     [0, 0], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1],
//     ...
// ]

These additions were made possible thanks to the following PRs:

๐Ÿ“ Improved chat templating operation coverage

With this release, we're pleased to announce that Transformers.js is now able to parse every single valid chat template that is currently on the Hugging Face Hub! ๐Ÿคฏ As of 2024/03/05, this is around ~12k conversational models (of which there were ~250 unique templates). Of course, future models may introduce more complex chat templates, and we'll continue to add support for them!

For example, transformers.js can now generate the prompt for highly complex function-calling models (e.g., fireworks-ai/firefunction-v1):

import { AutoTokenizer } from '@xenova/transformers';

const tokenizer = await AutoTokenizer.from_pretrained('fireworks-ai/firefunction-v1')

const function_spec = [
    {
        name: 'get_stock_price',
        description: 'Get the current stock price',
        parameters: {
            type: 'object',
            properties: {
                symbol: {
                    type: 'string',
                    description: 'The stock symbol, e.g. AAPL, GOOG'
                }
            },
            required: ['symbol']
        }
    },
    {
        name: 'check_word_anagram',
        description: 'Check if two words are anagrams of each other',
        parameters: {
            type: 'object',
            properties: {
                word1: {
                    type: 'string',
                    description: 'The first word'
                },
                word2: {
                    type: 'string',
                    description: 'The second word'
                }
            },
            required: ['word1', 'word2']
        }
    }
]

const messages = [
    { role: 'functions', content: JSON.stringify(function_spec, null, 4) },
    { role: 'system', content: 'You are a helpful assistant with access to functions. Use them if required.' },
    { role: 'user', content: 'Hi, can you tell me the current stock price of AAPL?' }
]

const inputs = tokenizer.apply_chat_template(messages, { tokenize: false });
console.log(inputs);
// <s>SYSTEM: You are a helpful assistant ...

๐ŸŽจ New example applications and demos

๐Ÿ› ๏ธ Misc. improvements

๐Ÿค— New contributors

Full Changelog: https://github.com/xenova/transformers.js/compare/2.15.1...2.16.0

transformers.js - 2.15.1

Published by xenova 8 months ago

What's new?

Full Changelog: https://github.com/xenova/transformers.js/compare/2.15.0...2.15.1

transformers.js - 2.15.0

Published by xenova 9 months ago

What's new?

๐Ÿค– Qwen1.5 Chat models (0.5B and 1.8B)

Yesterday, the Qwen team (Alibaba Group) released the Qwen1.5 series of chat models. As part of the release, they published several sub-2B-parameter models, including Qwen/Qwen1.5-0.5B-Chat and Qwen/Qwen1.5-1.8B-Chat, which both demonstrate strong performance despite their small sizes. The best part? They can run in the browser with Transformers.js (PR)! ๐Ÿš€ See here for the full list of supported models.

demo-2x

Example: Text generation with Xenova/Qwen1.5-0.5B-Chat.

import { pipeline } from '@xenova/transformers';

// Create text-generation pipeline
const generator = await pipeline('text-generation', 'Xenova/Qwen1.5-0.5B-Chat');

// Define the prompt and list of messages
const prompt = "Give me a short introduction to large language model."
const messages = [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": prompt }
]

// Apply chat template
const text = generator.tokenizer.apply_chat_template(messages, {
    tokenize: false,
    add_generation_prompt: true,
});

// Generate text
const output = await generator(text, {
    max_new_tokens: 128,
    do_sample: false,
});
console.log(output[0].generated_text);
// 'A large language model is a type of artificial intelligence system that can generate text based on the input provided by users, such as books, articles, or websites. It uses advanced algorithms and techniques to learn from vast amounts of data and improve its performance over time through machine learning and natural language processing (NLP). Large language models have become increasingly popular in recent years due to their ability to handle complex tasks such as generating human-like text quickly and accurately. They have also been used in various fields such as customer service chatbots, virtual assistants, and search engines for information retrieval purposes.'

๐Ÿง MODNet for Portrait Image Matting

Next, we added support for MODNet, a small (but powerful) portrait image matting model (PR). Thanks to @cyio for the suggestion!

animation

Example: Perform portrait image matting with Xenova/modnet.

import { AutoModel, AutoProcessor, RawImage } from '@xenova/transformers';

// Load model and processor
const model = await AutoModel.from_pretrained('Xenova/modnet', { quantized: false });
const processor = await AutoProcessor.from_pretrained('Xenova/modnet');

// Load image from URL
const url = 'https://images.pexels.com/photos/5965592/pexels-photo-5965592.jpeg?auto=compress&cs=tinysrgb&w=1024';
const image = await RawImage.fromURL(url);

// Pre-process image
const { pixel_values } = await processor(image);

// Predict alpha matte
const { output } = await model({ input: pixel_values });

// Save output mask
const mask = await RawImage.fromTensor(output[0].mul(255).to('uint8')).resize(image.width, image.height);
mask.save('mask.png');
Input image Output mask
image/png image/png

๐Ÿง  New text embedding models

We also added support for several new text embedding models, including:

Check out the links for example usage.

๐Ÿ› ๏ธ Other improvements

Full Changelog: https://github.com/xenova/transformers.js/compare/2.14.2...2.15.0

transformers.js - 2.14.2

Published by xenova 9 months ago

What's new?

Full Changelog: https://github.com/xenova/transformers.js/compare/2.14.1...2.14.2

transformers.js - 2.14.1

Published by xenova 9 months ago

What's new?

Full Changelog: https://github.com/xenova/transformers.js/compare/2.14.0...2.14.1

transformers.js - 2.14.0

Published by xenova 9 months ago

What's new?

๐Ÿš€ Segment Anything Model (SAM)

The Segment Anything Model (SAM) can be used to generate segmentation masks for objects in a scene, given an input image and input points. See here for the full list of pre-converted models. Support for this model was added in https://github.com/xenova/transformers.js/pull/510.

demo

Demo + source code: https://huggingface.co/spaces/Xenova/segment-anything-web

Example: Perform mask generation w/ Xenova/slimsam-77-uniform.

import { SamModel, AutoProcessor, RawImage } from '@xenova/transformers';

const model = await SamModel.from_pretrained('Xenova/slimsam-77-uniform');
const processor = await AutoProcessor.from_pretrained('Xenova/slimsam-77-uniform');

const img_url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/corgi.jpg';
const raw_image = await RawImage.read(img_url);
const input_points = [[[340, 250]]] // 2D localization of a window

const inputs = await processor(raw_image, input_points);
const outputs = await model(inputs);

const masks = await processor.post_process_masks(outputs.pred_masks, inputs.original_sizes, inputs.reshaped_input_sizes);
console.log(masks);
// [
//   Tensor {
//     dims: [ 1, 3, 410, 614 ],
//     type: 'bool',
//     data: Uint8Array(755220) [ ... ],
//     size: 755220
//   }
// ]
const scores = outputs.iou_scores;
console.log(scores);
// Tensor {
//   dims: [ 1, 1, 3 ],
//   type: 'float32',
//   data: Float32Array(3) [
//     0.8350210189819336,
//     0.9786665439605713,
//     0.8379436731338501
//   ],
//   size: 3
// }

You can then visualize the 3 predicted masks with:

const image = RawImage.fromTensor(masks[0][0].mul(255));
image.save('mask.png');
Input image Visualized output
corgi mask

Next, select the channel with the highest IoU score, which in this case is the second (green) channel. Intersecting this with the original image gives us an isolated version of the subject:

Selected Mask Intersected
mask corgi-masked

๐Ÿ› ๏ธ Improvements

Full Changelog: https://github.com/xenova/transformers.js/compare/2.13.4...2.14.0

transformers.js - 2.13.4

Published by xenova 10 months ago

What's new?

  • Add support for cross-encoder models (+fix token type ids) (#501)

    Example: Information Retrieval w/ Xenova/ms-marco-TinyBERT-L-2-v2.

    import { AutoTokenizer, AutoModelForSequenceClassification } from '@xenova/transformers';
    
    const model = await AutoModelForSequenceClassification.from_pretrained('Xenova/ms-marco-TinyBERT-L-2-v2');
    const tokenizer = await AutoTokenizer.from_pretrained('Xenova/ms-marco-TinyBERT-L-2-v2');
    
    const features = tokenizer(
        ['How many people live in Berlin?', 'How many people live in Berlin?'],
        {
            text_pair: [
                'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.',
                'New York City is famous for the Metropolitan Museum of Art.',
            ],
            padding: true,
            truncation: true,
        }
    )
    
    const { logits } = await model(features)
    console.log(logits.data);
    // quantized:   [ 7.210887908935547, -11.559350967407227 ]
    // unquantized: [ 7.235750675201416, -11.562294006347656 ]
    

    Check out the list of pre-converted models here. We also put out a demo for you to try out.

Full Changelog: https://github.com/xenova/transformers.js/compare/2.13.3...2.13.4

transformers.js - 2.13.3

Published by xenova 10 months ago

What's new?

Full Changelog: https://github.com/xenova/transformers.js/compare/2.13.2...2.13.3

transformers.js - 2.13.2

Published by xenova 10 months ago

What's new?

This release is a follow-up to #485, with additional intellisense-focused improvements (see PR).

typing-demo-new

Full Changelog: https://github.com/xenova/transformers.js/compare/2.13.1...2.13.2

transformers.js - 2.13.1

Published by xenova 10 months ago

What's new?

  • Improve typing of pipeline function in https://github.com/xenova/transformers.js/pull/485. Thanks to @wesbos for the suggestion!

    typing-demo

    This also means when you hover over the class name, you'll get example code to help you out.
    typing-demo2

  • Add phi-1_5 model in https://github.com/xenova/transformers.js/pull/493.

    import { pipeline } from '@xenova/transformers';
    
    // Create a text-generation pipeline
    const generator = await pipeline('text-generation', 'Xenova/phi-1_5_dev');
    
    // Construct prompt
    const prompt = `\`\`\`py
    import math
    def print_prime(n):
        """
        Print all primes between 1 and n
        """`;
    
    // Generate text
    const result = await generator(prompt, {
      max_new_tokens: 100,
    });
    console.log(result[0].generated_text);
    

    Results in:

    import math
    def print_prime(n):
        """
        Print all primes between 1 and n
        """
        primes = []
        for num in range(2, n+1):
            is_prime = True
            for i in range(2, int(math.sqrt(num))+1):
                if num % i == 0:
                    is_prime = False
                    break
            if is_prime:
                primes.append(num)
        print(primes)
    
    print_prime(20)
    

    Running the code produces the correct result:

    [2, 3, 5, 7, 11, 13, 17, 19]
    

Full Changelog: https://github.com/xenova/transformers.js/compare/2.13.0...2.13.1

transformers.js - 2.13.0

Published by xenova 10 months ago

What's new?

๐ŸŽ„ 7 new architectures!

This release adds support for many new multimodal architectures, bringing the total number of supported architectures to 80! ๐Ÿคฏ

1. VITS for multilingual text-to-speech across over 1000 languages! (https://github.com/xenova/transformers.js/pull/466)

import { pipeline } from '@xenova/transformers';

// Create English text-to-speech pipeline
const synthesizer = await pipeline('text-to-speech', 'Xenova/mms-tts-eng');

// Generate speech
const output = await synthesizer('I love transformers');
// {
//   audio: Float32Array(26112) [...],
//   sampling_rate: 16000
// }

https://github.com/xenova/transformers.js/assets/26504141/63c1a315-1ad6-44a2-9a2f-6689e2d9d14e

See here for the list of available models. To start, we've converted 12 of the ~1140 models on the Hugging Face Hub. If we haven't added the one you wish to use, you can make it web-ready using our conversion script.

2. CLIPSeg for zero-shot image segmentation. (https://github.com/xenova/transformers.js/pull/478)

import { AutoTokenizer, AutoProcessor, CLIPSegForImageSegmentation, RawImage } from '@xenova/transformers';

// Load tokenizer, processor, and model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clipseg-rd64-refined');
const processor = await AutoProcessor.from_pretrained('Xenova/clipseg-rd64-refined');
const model = await CLIPSegForImageSegmentation.from_pretrained('Xenova/clipseg-rd64-refined');

// Run tokenization
const texts = ['a glass', 'something to fill', 'wood', 'a jar'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Read image and run processor
const image = await RawImage.read('https://github.com/timojl/clipseg/blob/master/example_image.jpg?raw=true');
const image_inputs = await processor(image);

// Run model with both text and pixel inputs
const { logits } = await model({ ...text_inputs, ...image_inputs });
// logits: Tensor {
//   dims: [4, 352, 352],
//   type: 'float32',
//   data: Float32Array(495616)[ ... ],
//   size: 495616
// }

You can visualize the predictions as follows:

const preds = logits
  .unsqueeze_(1)
  .sigmoid_()
  .mul_(255)
  .round_()
  .to('uint8');

for (let i = 0; i < preds.dims[0]; ++i) {
  const img = RawImage.fromTensor(preds[i]);
  img.save(`prediction_${i}.png`);
}
Original "a glass" "something to fill" "wood" "a jar"
image prediction_0 prediction_1 prediction_2 prediction_3

See here for the list of available models.

3. SegFormer for semantic segmentation and image classification. (https://github.com/xenova/transformers.js/pull/480)

import { pipeline } from '@xenova/transformers';

// Create an image segmentation pipeline
const segmenter = await pipeline('image-segmentation', 'Xenova/segformer_b2_clothes');

// Segment an image
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/young-man-standing-and-leaning-on-car.jpg';
const output = await segmenter(url);

image

[
  {
    score: null,
    label: 'Background',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Hair',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Upper-clothes',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Pants',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Left-shoe',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Right-shoe',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Face',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Left-leg',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Right-leg',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Left-arm',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Right-arm',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  }
]

See here for the list of available models.

4. Table Transformer for table extraction from unstructured documents. (https://github.com/xenova/transformers.js/pull/477)

import { pipeline } from '@xenova/transformers';

// Create an object detection pipeline
const detector = await pipeline('object-detection', 'Xenova/table-transformer-detection', { quantized: false });

// Detect tables in an image
const img = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/invoice-with-table.png';
const output = await detector(img);
// [{ score: 0.9967531561851501, label: 'table', box: { xmin: 52, ymin: 322, xmax: 546, ymax: 525 } }]

image

See here for the list of available models.

5. DiT for document image classification. (https://github.com/xenova/transformers.js/pull/474)

import { pipeline } from '@xenova/transformers';

// Create an image classification pipeline
const classifier = await pipeline('image-classification', 'Xenova/dit-base-finetuned-rvlcdip');

// Classify an image 
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/coca_cola_advertisement.png';
const output = await classifier(url);
// [{ label: 'advertisement', score: 0.9035086035728455 }]

See here for the list of available models.

6. SigLIP for zero-shot image classification. (https://github.com/xenova/transformers.js/pull/473)

import { pipeline } from '@xenova/transformers';

// Create a zero-shot image classification pipeline
const classifier = await pipeline('zero-shot-image-classification', 'Xenova/siglip-base-patch16-224');

// Classify images according to provided labels
const url = 'http://images.cocodataset.org/val2017/000000039769.jpg';
const output = await classifier(url, ['2 cats', '2 dogs'], {
    hypothesis_template: 'a photo of {}',
});
// [
//   { score: 0.16770583391189575, label: '2 cats' },
//   { score: 0.000022096000975579955, label: '2 dogs' }
// ]

See here for the list of available models.

7. RoFormer for masked language modelling, sequence classification, token classification, and question answering. (https://github.com/xenova/transformers.js/pull/464)

import { pipeline } from '@xenova/transformers';

// Create a masked language modelling pipeline
const pipe = await pipeline('fill-mask', 'Xenova/antiberta2');

// Predict missing token
const output = await pipe('แธข Q V Q ... C A [MASK] D ... T V S S');
[
  {
    score: 0.48774364590644836,
    token: 19,
    token_str: 'R',
    sequence: 'แธข Q V Q C A R D T V S S'
  },
  {
    score: 0.2768442928791046,
    token: 18,
    token_str: 'Q',
    sequence: 'แธข Q V Q C A Q D T V S S'
  },
  {
    score: 0.0890476182103157,
    token: 13,
    token_str: 'K',
    sequence: 'แธข Q V Q C A K D T V S S'
  },
  {
    score: 0.05106702819466591,
    token: 14,
    token_str: 'L',
    sequence: 'แธข Q V Q C A L D T V S S'
  },
  {
    score: 0.021606773138046265,
    token: 8,
    token_str: 'E',
    sequence: 'แธข Q V Q C A E D T V S S'
  }
]

See here for the list of available models.

๐Ÿ› ๏ธ Misc. improvements

๐Ÿค— New Contributors

Full Changelog: https://github.com/xenova/transformers.js/compare/2.12.1...2.13.0

transformers.js - 2.12.1

Published by xenova 10 months ago

What's new?

Patch for release 2.12.1, making @huggingface/jinja a dependency instead of a peer dependency. This also means apply_chat_template is now synchronous (and does not lazily load the module). In future, we may want to add this functionality, but for now, it causes issues with lazy loading from a CDN.

code

import { AutoTokenizer } from "@xenova/transformers";

// Load tokenizer from the Hugging Face Hub
const tokenizer = await AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1");

// Define chat messages
const chat = [
  { role: "user", content: "Hello, how are you?" },
  { role: "assistant", content: "I'm doing great. How can I help you today?" },
  { role: "user", content: "I'd like to show off how chat templating works!" },
]

const text = tokenizer.apply_chat_template(chat, { tokenize: false });
// "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false });
// [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, ...]

Full Changelog: https://github.com/xenova/transformers.js/compare/2.12.0...2.12.1

transformers.js - 2.12.0

Published by xenova 10 months ago

What's new?

๐Ÿ’ฌ Chat templates!

This release adds support for chat templates, a highly-requested feature that enables users to convert conversations (represented as a list of chat objects) into a single tokenizable string, in the format that the model expects. As you may know, chat templates can vary greatly across model types, so it was important to design a system that: (1) supports complex chat templates; (2) is generalizable, and (3) is easy to use. So, how did we do it? ๐Ÿค”

This is made possible with @huggingface/jinja, a minimalistic JavaScript implementation of the Jinja templating engine, that we created to align with how transformers handles templating. Although it was originally designed for parsing and rendering ChatML templates, we decided to separate out the templating logic into an external (optional) library due to its usefulness in other types of applications. Special thanks to @tlaceby for his amazing "Guide to Interpreters" series, which provided the basis for our implementation. ๐Ÿค—

Anyway, let's take a look at an example:

import { AutoTokenizer } from "@xenova/transformers";

// Load tokenizer from the Hugging Face Hub
const tokenizer = await AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1");

// Define chat messages
const chat = [
  { role: "user", content: "Hello, how are you?" },
  { role: "assistant", content: "I'm doing great. How can I help you today?" },
  { role: "user", content: "I'd like to show off how chat templating works!" },
]

const text = tokenizer.apply_chat_template(chat, { tokenize: false });
// "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

Notice how the entire chat is condensed into a single string. If you would instead like to return the tokenized version (i.e., a list of token IDs), you can use the following:

const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false });
// [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, 28737, 28742, 28719, 2548, 1598, 28723, 1602, 541, 315, 1316, 368, 3154, 28804, 2, 28705, 733, 16289, 28793, 315, 28742, 28715, 737, 298, 1347, 805, 910, 10706, 5752, 1077, 3791, 28808, 733, 28748, 16289, 28793]

For more information about chat templates, check out the transformers documentation.

๐Ÿ› Bug fixes

  • Incorrect encoding/decoding of whitespace around special characters with Fast Llama tokenizers. These bugs will also soon be fixed in the transformers library. For backwards compatibility reasons, if the tokenizer was exported with the legacy behaviour, it will still act in the same way unless explicitly set otherwise. Newer exports won't be affected. If you wish to override this default, to either still use the legacy behaviour (for backwards compatibility reasons), or to upgrade to the fixed version, you can do so with:

    // Use the default behaviour (specified in tokenizer_config.json, which in the case is `{legacy: false}`).
    const tokenizer = await AutoTokenizer.from_pretrained('Xenova/llama2-tokenizer');
    const { input_ids } = tokenizer('<s>\n', { add_special_tokens: false, return_tensor: false });
    console.log(input_ids); // [1, 13]
    
    // Use the legacy behaviour
    const tokenizer = await AutoTokenizer.from_pretrained('Xenova/llama2-tokenizer', { legacy: true });
    const { input_ids } = tokenizer('<s>\n', { add_special_tokens: false, return_tensor: false });
    console.log(input_ids); // [1, 29871, 13]
    
  • Strip whitespace around special tokens for wav2vec tokenizers.

๐Ÿ”จ Improvements

  • More comprehensive tokenizer test suite: including both static and dynamic tokenizer tests for encoding, decoding, and chat templates.

Full Changelog: https://github.com/xenova/transformers.js/compare/2.11.0...2.12.0

transformers.js - 2.11.0

Published by xenova 10 months ago

What's new?

๐Ÿคฏ 8 new architectures!

This release adds support for a bunch of new model architectures, covering a wide range of use cases! In total, we now support 73 different model architectures!

1. ViTMatte for image matting (https://github.com/xenova/transformers.js/pull/448). See here for the list of available models.

Example: Image matting w/ Xenova/vitmatte-small-distinctions-646.

import { AutoProcessor, VitMatteForImageMatting, RawImage } from '@xenova/transformers';

// Load processor and model
const processor = await AutoProcessor.from_pretrained('Xenova/vitmatte-small-distinctions-646');
const model = await VitMatteForImageMatting.from_pretrained('Xenova/vitmatte-small-distinctions-646');

// Load image and trimap
const image = await RawImage.fromURL('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/vitmatte_image.png');
const trimap = await RawImage.fromURL('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/vitmatte_trimap.png');

// Prepare image + trimap for the model
const inputs = await processor(image, trimap);

// Predict alpha matte
const { alphas } = await model(inputs);
// Tensor {
//   dims: [ 1, 1, 640, 960 ],
//   type: 'float32',
//   size: 614400,
//   data: Float32Array(614400) [ 0.9894027709960938, 0.9970508813858032, ... ]
// }
import { Tensor, cat } from '@xenova/transformers';

// Visualize predicted alpha matte
const imageTensor = new Tensor(
  'uint8',
  new Uint8Array(image.data),
  [image.height, image.width, image.channels]
).transpose(2, 0, 1);

// Convert float (0-1) alpha matte to uint8 (0-255)
const alphaChannel = alphas
  .squeeze(0)
  .mul_(255)
  .clamp_(0, 255)
  .round_()
  .to('uint8');

// Concatenate original image with predicted alpha
const imageData = cat([imageTensor, alphaChannel], 0);

// Save output image
const outputImage = RawImage.fromTensor(imageData);
outputImage.save('output.png');

Inputs:

Image Trimap
vitmatte_image vitmatte_trimap

Outputs:

Quantized Unquantized
output_quantized output_unquantized

2. ESM for protein sequence feature-extraction, masked language modelling, token classification, and zero-shot classification (https://github.com/xenova/transformers.js/pull/447). See here for the list of available models.

Example: Protein sequence classification w/ Xenova/esm2_t6_8M_UR50D_sequence_classifier_v1.

import { pipeline } from '@xenova/transformers';

// Create text classification pipeline
const classifier = await pipeline('text-classification', 'Xenova/esm2_t6_8M_UR50D_sequence_classifier_v1');

// Suppose these are your new sequences that you want to classify
// Additional Family 0: Enzymes
const new_sequences_0 = [ 'ACGYLKTPKLADPPVLRGDSSVTKAICKPDPVLEK', 'GVALDECKALDYLPGKPLPMDGKVCQCGSKTPLRP', 'VLPGYTCGELDCKPGKPLPKCGADKTQVATPFLRG', 'TCGALVQYPSCADPPVLRGSDSSVKACKKLDPQDK', 'GALCEECKLCPGADYKPMDGDRLPAAATSKTRPVG', 'PAVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYG', 'VLGYTCGALDCKPGKPLPKCGADKTQVATPFLRGA', 'CGALVQYPSCADPPVLRGSDSSVKACKKLDPQDKT', 'ALCEECKLCPGADYKPMDGDRLPAAATSKTRPVGK', 'AVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYGR' ]

// Additional Family 1: Receptor Proteins
const new_sequences_1 = [ 'VGQRFYGGRQKNRHCELSPLPSACRGSVQGALYTD', 'KDQVLTVPTYACRCCPKMDSKGRVPSTLRVKSARS', 'PLAGVACGRGLDYRCPRKMVPGDLQVTPATQRPYG', 'CGVRLGYPGCADVPLRGRSSFAPRACMKKDPRVTR', 'RKGVAYLYECRKLRCRADYKPRGMDGRRLPKASTT', 'RPTGAVNCKQAKVYRGLPLPMMGKVPRVCRSRRPY', 'RLDGGYTCGQALDCKPGRKPPKMGCADLKSTVATP', 'LGTCRKLVRYPQCADPPVMGRSSFRPKACCRQDPV', 'RVGYAMCSPKLCSCRADYKPPMGDGDRLPKAATSK', 'QPKAVNCRKAMVYRPKPLPMDKGVPVCRSKRPRPY' ]

// Additional Family 2: Structural Proteins
const new_sequences_2 = [ 'VGKGFRYGSSQKRYLHCQKSALPPSCRRGKGQGSAT', 'KDPTVMTVGTYSCQCPKQDSRGSVQPTSRVKTSRSK', 'PLVGKACGRSSDYKCPGQMVSGGSKQTPASQRPSYD', 'CGKKLVGYPSSKADVPLQGRSSFSPKACKKDPQMTS', 'RKGVASLYCSSKLSCKAQYSKGMSDGRSPKASSTTS', 'RPKSAASCEQAKSYRSLSLPSMKGKVPSKCSRSKRP', 'RSDVSYTSCSQSKDCKPSKPPKMSGSKDSSTVATPS', 'LSTCSKKVAYPSSKADPPSSGRSSFSMKACKKQDPPV', 'RVGSASSEPKSSCSVQSYSKPSMSGDSSPKASSTSK', 'QPSASNCEKMSSYRPSLPSMSKGVPSSRSKSSPPYQ' ]

// Merge all sequences
const new_sequences = [...new_sequences_0, ...new_sequences_1, ...new_sequences_2];

// Get the predicted class for each sequence
const predictions = await classifier(new_sequences);

// Output the predicted class for each sequence
for (let i = 0; i < predictions.length; ++i) {
    console.log(`Sequence: ${new_sequences[i]}, Predicted class: '${predictions[i].label}'`)
}
// Sequence: ACGYLKTPKLADPPVLRGDSSVTKAICKPDPVLEK, Predicted class: 'Enzymes'
// ... (truncated)
// Sequence: AVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYGR, Predicted class: 'Enzymes'
// Sequence: VGQRFYGGRQKNRHCELSPLPSACRGSVQGALYTD, Predicted class: 'Receptor Proteins'
// ... (truncated)
// Sequence: QPKAVNCRKAMVYRPKPLPMDKGVPVCRSKRPRPY, Predicted class: 'Receptor Proteins'
// Sequence: VGKGFRYGSSQKRYLHCQKSALPPSCRRGKGQGSAT, Predicted class: 'Structural Proteins'
// ... (truncated)
// Sequence: QPSASNCEKMSSYRPSLPSMSKGVPSSRSKSSPPYQ, Predicted class: 'Structural Proteins'

3. Hubert for audio classification, and automatic speech recognition (https://github.com/xenova/transformers.js/pull/449). See here for the list of available models.

Example: Speech command recognition w/ Xenova/hubert-base-superb-ks.

import { pipeline } from '@xenova/transformers';

// Create audio classification pipeline
const classifier = await pipeline('audio-classification', 'Xenova/hubert-base-superb-ks');

// Classify audio
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speech-commands_down.wav';
const output = await classifier(url, { topk: 5 });
// [
//   { label: 'down', score: 0.9954305291175842 },
//   { label: 'go', score: 0.004518700763583183 },
//   { label: '_unknown_', score: 0.00005029444946558215 },
//   { label: 'no', score: 4.877569494965428e-7 },
//   { label: 'stop', score: 5.504634081887616e-9 }
// ]

Example: Perform automatic speech recognition w/ Xenova/hubert-large-ls960-ft.

import { pipeline } from '@xenova/transformers';

// Create automatic speech recognition pipeline
const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/hubert-large-ls960-ft');

// Transcribe audio
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const output = await transcriber(url);
// { text: 'AND SO MY FELLOW AMERICA ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY' }

4. Chinese-CLIP for zero-shot image classification (https://github.com/xenova/transformers.js/pull/455). See here for the list of available models.

Example: Zero-shot image classification w/ Xenova/hubert-large-ls960-ft.

import { pipeline } from '@xenova/transformers';

// Create zero-shot image classification pipeline
const classifier = await pipeline('zero-shot-image-classification', 'Xenova/chinese-clip-vit-base-patch16');

// Set image url and candidate labels
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/pikachu.png';
const candidate_labels = ['ๆฐๅฐผ้พŸ', 'ๅฆ™่›™็งๅญ', 'ๅฐ็ซ้พ™', '็šฎๅกไธ˜'] // Squirtle, Bulbasaur, Charmander, Pikachu in Chinese

// Classify image
const output = await classifier(url, candidate_labels);
console.log(output);
// [
//   { score: 0.9926728010177612, label: '็šฎๅกไธ˜' },        // Pikachu
//   { score: 0.003480620216578245, label: 'ๅฆ™่›™็งๅญ' },    // Bulbasaur
//   { score: 0.001942147733643651, label: 'ๆฐๅฐผ้พŸ' },      // Squirtle
//   { score: 0.0019044597866013646, label: 'ๅฐ็ซ้พ™' }      // Charmander
// ]

5. DINOv2 for image classification (https://github.com/xenova/transformers.js/pull/444). See here for the list of available models.

Example: Image classification w/ Xenova/dinov2-small-imagenet1k-1-layer.

import { pipeline} from '@xenova/transformers';

// Create image classification pipeline
const classifier = await pipeline('image-classification', 'Xenova/dinov2-small-imagenet1k-1-layer');

// Classify an image
const url = 'http://images.cocodataset.org/val2017/000000039769.jpg';
const output = await classifier(url);
console.log(output)
// [{ label: 'tabby, tabby cat', score: 0.8088238835334778 }]

6. ConvBERT for feature extraction (https://github.com/xenova/transformers.js/pull/445). See here for the list of available models.

Example: Feature extraction w/ Xenova/conv-bert-small.

import { pipeline } from '@xenova/transformers';

// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/conv-bert-small');

// Perform feature extraction
const output = await extractor('This is a test sentence.');
console.log(output)
// Tensor {
//   dims: [ 1, 8, 256 ],
//   type: 'float32',
//   data: Float32Array(2048) [ -0.09434918314218521, 0.5715903043746948, ... ],
//   size: 2048
// }

7. ELECTRA for feature extraction (https://github.com/xenova/transformers.js/pull/446). See here for the list of available models.

Example: Feature extraction w/ Xenova/electra-small-discriminator.

import { pipeline } from '@xenova/transformers';

// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/electra-small-discriminator');

// Perform feature extraction
const output = await extractor('This is a test sentence.');
console.log(output)
// Tensor {
//   dims: [ 1, 8, 256 ],
//   type: 'float32',
//   data: Float32Array(2048) [ 0.5410046577453613, 0.18386700749397278, ... ],
//   size: 2048
// }

8. Phi for text generation (https://github.com/xenova/transformers.js/pull/443).

NOTE: This only adds support for the architecture. When the external data format is supported in ONNX Runtime, we will make an update that includes converted versions of the available Phi models.

๐Ÿ•น๏ธ New example: Semantic Music Search application

In the last release, we added support for CLAP models (CLIP but for audio), so in this one, we're releasing a simple demo application which shows how you can use a CLAP model to perform real-time semantic music search! For simplicity, we implemented everything in vanilla JavaScript, but feel free to adapt it to your framework of choice! As always, the source code is open source! ๐Ÿฅณ PR: https://github.com/xenova/transformers.js/pull/442

Demo video:

https://github.com/xenova/transformers.js/assets/26504141/72e09f8c-d6e9-4430-a56c-7994737966db

๐Ÿ› Bug fixes

๐Ÿ› ๏ธ Other features

๐Ÿ“„ Documentation

Full Changelog: https://github.com/xenova/transformers.js/compare/2.10.1...2.11.0

transformers.js - 2.10.1

Published by xenova 11 months ago

What's new?

๐Ÿ› Bug fixes

๐Ÿ› ๏ธ Misc. improvements

Full Changelog: https://github.com/xenova/transformers.js/compare/2.10.0...2.10.1

transformers.js - 2.10.0

Published by xenova 11 months ago

What's new?

๐ŸŽต New task: Zero-shot audio classification

The task of classifying audio into classes that are unseen during training. See here for more information.

Example: Perform zero-shot audio classification with Xenova/clap-htsat-unfused.

import { pipeline } from '@xenova/transformers';

// Create a zero-shot audio classification pipeline
const classifier = await pipeline('zero-shot-audio-classification', 'Xenova/clap-htsat-unfused');

const audio = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/dog_barking.wav';
const candidate_labels = ['dog', 'vaccum cleaner'];
const scores = await classifier(audio, candidate_labels);
// [
//   { score: 0.9993992447853088, label: 'dog' },
//   { score: 0.0006007603369653225, label: 'vaccum cleaner' }
// ]

dog_barking.webm

๐Ÿ’ป New architectures: CLAP, Audio Spectrogram Transformer, ConvNeXT, and ConvNeXT-v2

We added support for 4 new architectures, bringing the total up to 65!

  1. CLAP for zero-shot audio classification, text embeddings, and audio embeddings (https://github.com/xenova/transformers.js/pull/427). See here for the list of available models.

    • Zero-shot audio classification (same as above)

    • Text embeddings with Xenova/clap-htsat-unfused:

      import { AutoTokenizer, ClapTextModelWithProjection } from '@xenova/transformers';
      
      // Load tokenizer and text model
      const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clap-htsat-unfused');
      const text_model = await ClapTextModelWithProjection.from_pretrained('Xenova/clap-htsat-unfused');
      
      // Run tokenization
      const texts = ['a sound of a cat', 'a sound of a dog'];
      const text_inputs = tokenizer(texts, { padding: true, truncation: true });
      
      // Compute embeddings
      const { text_embeds } = await text_model(text_inputs);
      // Tensor {
      //   dims: [ 2, 512 ],
      //   type: 'float32',
      //   data: Float32Array(1024) [ ... ],
      //   size: 1024
      // }
      
    • Audio embeddings with Xenova/clap-htsat-unfused:

      import { AutoProcessor, ClapAudioModelWithProjection, read_audio } from '@xenova/transformers';
      
      // Load processor and audio model
      const processor = await AutoProcessor.from_pretrained('Xenova/clap-htsat-unfused');
      const audio_model = await ClapAudioModelWithProjection.from_pretrained('Xenova/clap-htsat-unfused');
      
      // Read audio and run processor
      const audio = await read_audio('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cat_meow.wav');
      const audio_inputs = await processor(audio);
      
      // Compute embeddings
      const { audio_embeds } = await audio_model(audio_inputs);
      // Tensor {
      //   dims: [ 1, 512 ],
      //   type: 'float32',
      //   data: Float32Array(512) [ ... ],
      //   size: 512
      // }
      
  2. Audio Spectrogram Transformer for audio classification (https://github.com/xenova/transformers.js/pull/427). See here for the list of available models.

    import { pipeline } from '@xenova/transformers';
    
    // Create an audio classification pipeline
    const classifier = await pipeline('audio-classification', 'Xenova/ast-finetuned-audioset-10-10-0.4593');
    
    // Predict class
    const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cat_meow.wav';
    const output = await classifier(url, { topk: 4 });
    // [
    //   { label: 'Meow', score: 0.5617874264717102 },
    //   { label: 'Cat', score: 0.22365376353263855 },
    //   { label: 'Domestic animals, pets', score: 0.1141069084405899 },
    //   { label: 'Animal', score: 0.08985692262649536 },
    // ]
    
  3. ConvNeXT for image classification (https://github.com/xenova/transformers.js/pull/428). See here for the list of available models.

    import { pipeline } from '@xenova/transformers';
    
    // Create image classification pipeline
    const classifier = await pipeline('image-classification', 'Xenova/convnext-tiny-224');
    
    // Classify an image
    const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg';
    const output = await classifier(url);
    // [{ label: 'tiger, Panthera tigris', score: 0.6153212785720825 }]
    
  4. ConvNeXT-v2 for image classification (https://github.com/xenova/transformers.js/pull/428). See here for the list of available models.

    import { pipeline } from '@xenova/transformers';
    
    // Create image classification pipeline
    const classifier = await pipeline('image-classification', 'Xenova/convnextv2-atto-1k-224');
    
    // Classify an image
    const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg';
    const output = await classifier(url);
    // [{ label: 'tiger, Panthera tigris', score: 0.6391205191612244 }]
    

๐Ÿ”จ Other improvements

Full Changelog: https://github.com/xenova/transformers.js/compare/2.9.0...2.10.0

Package Rankings
Top 1.41% on Npmjs.org