spacy-huggingface-pipelines: Use pretrained transformer models for text and token classification

This package provides spaCy components to use pretrained Hugging Face Transformers pipelines for inference only.

Features

Apply pretrained transformers models like
dslim/bert-base-NER and
distilbert-base-uncased-finetuned-sst-2-english.

🚀 Installation

Installing the package from pip will automatically install all dependencies, including PyTorch and spaCy.

pip install -U pip setuptools wheel
pip install spacy-huggingface-pipelines

For GPU installation, follow the spaCy installation quickstart with GPU, e.g.

pip install -U spacy[cuda12x]

If you are having trouble installing PyTorch, follow the instructions on the official website for your specific operating system and requirements.

📖 Documentation

This module provides spaCy wrappers for the inference-only transformers TokenClassificationPipeline and TextClassificationPipeline pipelines.

The models are downloaded on initialization from the Hugging Face Hub if they're not already in your local cache, or alternatively they can be loaded from a local path.

Note that the transformer model data is not saved with the pipeline when you call nlp.to_disk, so if you are loading pipelines in an environment with limited internet access, make sure the model is available in your transformers cache directory and enable offline mode if needed.

Token classification

Config settings for hf_token_pipe:

[components.hf_token_pipe]
factory = "hf_token_pipe"
model = "dslim/bert-base-NER"     # Model name or path
revision = "main"                 # Model revision
aggregation_strategy = "average"  # "simple", "first", "average", "max"
stride = 16                       # If stride >= 0, process long texts in
                                  # overlapping windows of the model max
                                  # length. The value is the length of the
                                  # window overlap in transformer tokenizer
                                  # tokens, NOT the length of the stride.
kwargs = {}                       # Any additional arguments for
                                  # TokenClassificationPipeline
alignment_mode = "strict"         # "strict", "contract", "expand"
annotate = "ents"                 # "ents", "pos", "spans", "tag"
annotate_spans_key = null         # Doc.spans key for annotate = "spans"
scorer = null                     # Optional scorer

`TokenClassificationPipeline` settings

model: The model name or path.
revision: The model revision. For production use, a specific git commit is
recommended instead of the default main.
stride: For stride >= 0, the text is processed in overlapping windows
where the stride setting specifies the number of overlapping tokens between
windows (NOT the stride length). If stride is None, then the text may be
truncated. stride is only supported for fast tokenizers.
aggregation_strategy: The aggregation strategy determines the word-level
tags for cases where subwords within one word do not receive the same
predicted tag. See:
https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy
kwargs: Any additional arguments to
TokenClassificationPipeline.

spaCy settings

alignment_mode determines how transformer predictions are aligned to spaCy
token boundaries as described for
Doc.char_span.
annotate and annotate_spans_key configure how the annotation is saved to
the spaCy doc. You can save the output as token.tag_, token.pos_ (only for
UPOS tags), doc.ents or doc.spans.

Examples

Save named entity annotation as Doc.ents:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe("hf_token_pipe", config={"model": "dslim/bert-base-NER"})
doc = nlp("My name is Sarah and I live in London")
print(doc.ents)
# (Sarah, London)

Save named entity annotation as Doc.spans[spans_key] and scores as
Doc.spans[spans_key].attrs["scores"]:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={
        "model": "dslim/bert-base-NER",
        "annotate": "spans",
        "annotate_spans_key": "bert-base-ner",
    },
)
doc = nlp("My name is Sarah and I live in London")
print(doc.spans["bert-base-ner"])
# [Sarah, London]
print(doc.spans["bert-base-ner"].attrs["scores"])
# [0.99854773, 0.9996215]

Save fine-grained tags as Token.tag:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={
        "model": "QCRI/bert-base-multilingual-cased-pos-english",
        "annotate": "tag",
    },
)
doc = nlp("My name is Sarah and I live in London")
print([t.tag_ for t in doc])
# ['PRP$', 'NN', 'VBZ', 'NNP', 'CC', 'PRP', 'VBP', 'IN', 'NNP']

Save coarse-grained tags as Token.pos:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={"model": "vblagoje/bert-english-uncased-finetuned-pos", "annotate": "pos"},
)
doc = nlp("My name is Sarah and I live in London")
print([t.pos_ for t in doc])
# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']

Text classification

Config settings for hf_text_pipe:

[components.hf_text_pipe]
factory = "hf_text_pipe"
model = "distilbert-base-uncased-finetuned-sst-2-english"  # Model name or path
revision = "main"                 # Model revision
kwargs = {}                       # Any additional arguments for
                                  # TextClassificationPipeline
scorer = null                     # Optional scorer

The input texts are truncated according to the transformers model max length.

`TextClassificationPipeline` settings

model: The model name or path.
revision: The model revision. For production use, a specific git commit is
recommended instead of the default main.
kwargs: Any additional arguments to
TextClassificationPipeline.

Example

import spacy

nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_text_pipe",
    config={"model": "distilbert-base-uncased-finetuned-sst-2-english"},
)
doc = nlp("This is great!")
print(doc.cats)
# {'POSITIVE': 0.9998694658279419, 'NEGATIVE': 0.00013048505934420973}

Batching and GPU

Both token and text classification support batching with nlp.pipe:

for doc in nlp.pipe(texts, batch_size=256):
    do_something(doc)

If the component runs into an error processing a batch (e.g. on an empty text), nlp.pipe will back off to processing each text individually. If it runs into an error on an individual text, a warning is shown and the doc is returned without additional annotation.

Switch to GPU:

import spacy
spacy.require_gpu()

for doc in nlp.pipe(texts):
    do_something(doc)

Bug reports and issues

Please report bugs in the spaCy issue tracker or open a new thread on the discussion board for other issues.

Package Rankings

Top 11.64% on Pypi.org

Badges

Extracted from project README's

Related Projects

spacy-llm

🦙 Integrating LLMs into structured NLP pipelines

16 Mar 2023 1,093

spacyface

Align the token outputs from Spacy and Huggingface to help understand what language structures tr...

13 Jan 2020 44

spacy-universal-sentence-encoder

Google USE (Universal Sentence Encoder) for spaCy

20 Jan 2020 177

spikex

SpikeX - SpaCy Pipes for Knowledge Extraction

09 Jul 2020 397

spacy-experimental

🧪 Cutting-edge experimental spaCy components and features

18 Nov 2021 94

neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks

03 Jul 2017 2,849

concise-concepts

This repository contains an easy and intuitive approach to few-shot NER using most similar expans...

13 Mar 2022 241

spacy-dbpedia-spotlight

A spaCy wrapper for DBpedia Spotlight

29 Apr 2020 103

spacy-transformers

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

26 Jul 2019 1,343

spacy-wrap

spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface ...

30 Jan 2022 46

classy-classification

This repository contains an easy and intuitive approach to few-shot classification using sentence...

21 Feb 2022 209

spacy-models

💫 Models for the spaCy Natural Language Processing (NLP) library

14 Mar 2017 1,618

bagpipes-spacy

Bagpipes spaCy is a collection of custom spaCy pipeline components designed to enhance text proce...

26 Aug 2023 8

SpanMarkerNER

SpanMarker for Named Entity Recognition

28 Mar 2023 386

spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy

31 Jan 2019 711

spacy-huggingface-pipelines