๐ซ Industrial-strength Natural Language Processing (NLP) in Python
MIT License
Bot releases are hidden (Show)
Published by adrianeboyd almost 2 years ago
EntityLinker
.Doc.to_json()
for attributes set by getters.pipeline_package.load()
.spacy project
requirements checks for unsupported specifiers and requirements lines.spacy.load(disable=)
that could enable currently disabled components.@aaronzipp, @adrianeboyd, @honnibal, @ines, @polm, @rmitsch, @ryndaniels, @svlandeg, @thomashacker
Published by adrianeboyd almost 2 years ago
spacy.ConsoleLogger.v2
optionally saves training logs to JSONL (#11214).DependencyMatcher
to include matching parents or children to the left or the right of the node (#10371).cuda11x
and cuda-autodetect
(using cupy-wheel
) (#11279).Doc.to_json()
and Doc.from_json()
(#11125).enable
and disable
options for spacy.load()
more consistent (#11459).disable
/enclude
/exclude
for spacy.load()
(#11406).--url
flag for spacy info
to print the direct download URL for a pipeline (#11175).spacy project
CLI (#11226).spacy debug data
CLI for spancat data (#11504).spacy_version
in spacy package
metadata (#11552).spacy project assets
(#11458).spacy pretrain
command (#11210).natto-py
for the ko
extra (#11222).This release includes updated English pipelines for spaCy v3.4 with improved NER performance. The updates in en_core_web_*
v3.4.1 address issues related to training from data with partial named entity annotation, which led to lower NER recall in English pipeline versions v3.0.0โv3.4.0. In particular, entities that appear in the sections of the OntoNotes training data without NER annotation were not predicted consistently by the earlier pipeline versions, such as names and places that are frequent in the Biblical sections, e.g., "David" and "Egypt" (see #7493).
Use spacy download
to update your English pipelines to the newest version. If you'd prefer to keep using an earlier version, you can specify the version directly with e.g. spacy download -d en_core_web_sm-3.4.0
. You can check that you are using the new version (v3.4.1) with spacy validate
:
NAME SPACY VERSION
en_core_web_md >=3.4.0,<3.5.0 3.4.1 โ
SetPredicate
.Doc.__init__
.pymorphy2_lookup
lemmatizer mode for Russian and Ukrainian.Doc
type, an error will now be raised (#11424).spacy.models_and_pipes_with_nvtx_range.v1
callback.Example
API documentation.displacy
docs.spacy project dvc
.spacy-wordnet
.initialize()
function for pipeline components.@adrianeboyd, @bdura, @danieldk, @diyclassics, @DSLituiev, @GabrielePicco, @honnibal, @ines, @JulesBelveze, @kadarakos, @ljvmiranda921, @ninjalu, @pmbaumgartner, @polm, @radandreicristian, @richardpaulhudson, @rmitsch, @shadeMe, @stefawolf, @svlandeg, @thomashacker, @tobiusaolo, @tzussman , @yasufumy
Published by adrianeboyd almost 2 years ago
@adrianeboyd, @honnibal, @ines
Published by adrianeboyd about 2 years ago
@adrianeboyd, @danieldk, @honnibal, @ines, @lll-lll-lll-lll, @Lucaterre, @MaartenGr, @mr-bjerre, @polm, @radenkovic
Published by adrianeboyd over 2 years ago
{n,m}
operator for Matcher
patterns (#10981).saxpy
/sgemm
provided by the Ops
implementation in order to use Accelerate through thinc-apple-ops
(#10773).Example.get_aligned_parse
and Example.get_aligned
(#10952).StringStore
lookups (#10938).spacy project clone
to try both main
and master
branches by default (#10843).init_config_cli
(#10788).debug data
(#10960).TrainablePipe
components (#10965).SPACY_NUM_BUILD_JOBS
to specify the number of build jobs to run in parallel with pip
(#11073).We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.
Package | UPOS | Parser LAS | NER F |
---|---|---|---|
hr_core_news_sm |
96.6 | 77.5 | 76.1 |
hr_core_news_md |
97.3 | 80.1 | 81.8 |
hr_core_news_lg |
97.5 | 80.4 | 83.0 |
๐ Special thanks to @gtoffoli for help with the new pipelines!
The English pipelines have new word vectors:
Package | Model Version | TAG | Parser LAS | NER F |
---|---|---|---|---|
en_core_news_md |
v3.3.0 | 97.3 | 90.1 | 84.6 |
en_core_news_md |
v3.4.0 | 97.2 | 90.3 | 85.5 |
en_core_news_lg |
v3.3.0 | 97.4 | 90.1 | 85.3 |
en_core_news_lg |
v3.4.0 | 97.3 | 90.2 | 85.6 |
All CNN pipelines have been extended to add whitespace augmentation.
Doc.has_vector
, distinguish 0-vectors and missing vectors in similarity
warnings.get_array_module
in textcat
.Doc.has_vector
now matches Token.has_vector
and Span.has_vector
: it returns True
if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.@adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere
Published by danieldk over 2 years ago
Doc.spans[spans_key]
.Doc
objects.debug data
.Doc
objects.SpanGroup
objects that share the same name within one SpanGroups
container.walk_head_nodes
to avoid acquiring the GIL.StringStore.__getitem__
return type dependent on its parameter type.PhraseMatcher
.SpanGroups.setdefault
to also support Iterable[SpanGroup]
as the default.ROOT
is in the glossary.Doc.has_annotation
and Matcher
.Doc
inputs passed to Language.pipe()
.Doc
.Before this release, a validation bug allowed the configuration of a pipeline component to override the name of the pipeline itself through the name
attribute. For example, the following pipeline component:
[components.transformer]
factory = "transformer"
name = "custom_transformer_name"
would be registered erroneously as custom_transformer_name
. Such overrides are now ignored and a warning is emitted (#10779). From spaCy v3.3.1 onwards, this component will be registered as transformer
.
@adrianeboyd, @danieldk, @freddyheppell, @honnibal, @ines, @kadarakos, @ldorigo, @ljvmiranda921, @maxTarlov, @pmbaumgartner, @polm, @pypae, @richardpaulhudson, @rmitsch, @shadeMe, @single-fingal, @svlandeg
Published by adrianeboyd over 2 years ago
spacy.Tagger.v2
to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197).Ragged
with faster AlignmentArray
in Example
for training (#10319).Matcher
speed (#10659).Doc.spans
(#10250).spacy init config -p trainable_lemmatizer
or using the quickstart.thinc
v8.0.14+ and thinc-bigendian-ops
.spacy debug diff-config
.SpanCategorizer.set_candidates
for debugging span suggesters.spancat
and trainable_lemmatizer
components.v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
fi_core_news_sm |
Finnish | 92.5 | 71.9 | 75.9 |
fi_core_news_md |
Finnish | 95.9 | 78.6 | 80.6 |
fi_core_news_lg |
Finnish | 96.2 | 79.4 | 82.4 |
ko_core_news_sm |
Korean | 86.1 | 65.6 | 71.3 |
ko_core_news_md |
Korean | 94.7 | 80.9 | 83.1 |
ko_core_news_lg |
Korean | 94.7 | 81.3 | 85.3 |
sv_core_news_sm |
Swedish | 95.0 | 75.9 | 74.7 |
sv_core_news_md |
Swedish | 96.3 | 78.5 | 79.3 |
sv_core_news_lg |
Swedish | 96.3 | 79.1 | 81.1 |
๐ Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!
The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.
Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
---|---|---|
da_core_news_md |
84.9 | 94.8 |
de_core_news_md |
73.4 | 97.7 |
el_core_news_md |
56.5 | 88.9 |
fi_core_news_md |
- | 86.2 |
it_core_news_md |
86.6 | 97.2 |
ko_core_news_md |
- | 90.0 |
lt_core_news_md |
71.1 | 84.8 |
nb_core_news_md |
76.7 | 97.1 |
nl_core_news_md |
81.5 | 94.0 |
pl_core_news_md |
87.1 | 93.7 |
pt_core_news_md |
76.7 | 96.9 |
ro_core_news_md |
81.8 | 95.5 |
sv_core_news_md |
- | 95.5 |
Scorer.score_cats
for missing labels._
value for UPOS in CoNLL-U converter.Span
attributes consistently."spans"
to the output of doc.to_json
.Matcher
handling for all special cases.Example
to align whitespace annotation.Tok2Vec
for empty batches.rehearse
.Vectors.n_keys
for floret vectors.meta
in util.load_model_from_config
.Example.get_matching_ents
.Tokenizer.explain
.KoreanTokenizer
tag map.init vectors
.Tagger
architecture, edit your configs to switch from spacy.Tagger.v1
to spacy.Tagger.v2
and then run init fill-config
.<
, <=
, >
, >=
) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956).Doc.from_docs
now includes Doc.tensor
by default and supports excludes with an exclude
argument in the same format as Doc.to_bytes
. The supported exclude fields are spans
, tensor
and user_data
.@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996
Published by adrianeboyd over 2 years ago
@adrianeboyd, @honnibal, @ines
Published by adrianeboyd over 2 years ago
@adrianeboyd, @honnibal, @ines
Published by adrianeboyd over 2 years ago
Tok2Vec
for empty batches.@adrianeboyd, @honnibal, @ines
Published by adrianeboyd over 2 years ago
spancat
for empty docs and zero suggestions.Lexeme.rank
.Tok2Vec
for empty batches.@adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz
Published by adrianeboyd over 2 years ago
Tok2Vec
for empty batches.@adrianeboyd, @danieldk, @honnibal, @ines
Published by adrianeboyd over 2 years ago
parser
and ner
speeds on long documents (see technical details in #10019).spancat
components in debug data
.ENT_IOB
as a Matcher
token pattern key.ENT_IOB
.debug data
.Lexeme.rank
.spacy project
.Doc.from_docs()
for empty docs.debug data
for components with custom names.Underscore
and DependencyMatcher
and improve types in Language
, Matcher
and PhraseMatcher
.Tokenizer.explain
when infixes appear as prefixes.spancat
initialization.IS_SENT_END
in Doc.has_annotation
.spacy package
.PhraseMatcher
.Dockerfile
for repeatable website builds and easier local development.@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav
Published by adrianeboyd almost 3 years ago
doc_cleaner
component for removing doc.tensor
,doc._._trf_data
or other Doc
attributes at the end of the pipeline to reduce size of output docs.ENT_ID
and ENT_KB_ID
to Matcher
pattern attributes.kb_id
for entities in displaCy from Doc
input.Span.sents
property for spans spanning over more than one sentence.EntityRuler.remove
to remove patterns by id
.Tagger
neg_prefix
configurable.Language.pipe
in Language.evaluate
for more efficient processing.JsonlCorpus
path optional again.spancat
for empty docs and zero suggestions..jsonl
paths in EntityRuler
.Scorer.score_spans
to handle predicted docs with missing annotation.parser
from reference parse rather than aligned example.tagger
and morphologizer
.init_tok2vec
after pretraining, batch contract for listeners.eng-spacysentiment
: Sentiment analysis for English.@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar
Published by adrianeboyd almost 3 years ago
nlp()
and nlp.pipe()
accept Doc
input, which simplifies setting custom tokenization or extensions before processing.overwrite
config settings for entity_linker
, morphologizer
, tagger
, sentencizer
and senter
.extend
config setting for morphologizer
for whether existing feature types are preserved.spacy.blank()
including IETF language tags, for example fra
for French
and zh-Hans
for Chinese
.spacy-loggers
for additional loggers.sudachipy
are annotated as Token.morph
features.morph_micro_p/r/f
scores for morphological features from Scorer.score_morph_per_feat()
.LIKE_URL
attribute includes the tokenizer URL pattern.--n-save-epoch
option for spacy pretrain
.ja_core_news_trf
, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!tok2vec
feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.Token.pos
and Token.morph
.For more details, see the New in v3.2 usage guide.
Language.pipe(as_tuples=True)
for multiprocessing with custom error handlers.Tokenizer
.Tokenizer
, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of ยฐ[cfk].
is now ยฐ c .
instead of ยฐ c.
for most languages.ChineseTokenizer
, JapaneseTokenizer
, KoreanTokenizer
, ThaiTokenizer
and VietnameseTokenizer
require Vocab
rather than Language
in __init__
.DocBin
, user data is now always serialized according to the store_user_data
option, see #9190.pipelines/floret_vectors_demo
: basic floret vector training and importing.pipelines/floret_fi_core_demo
: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.pipelines/floret_ko_ud_demo
: Korean UD vector and pipeline training, comparing standard vs. floret vectors.@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker
Published by svlandeg almost 3 years ago
AppleOps
: pip install spacy[apple]
.spacy.models_with_nvtx_range.v1
.mypy
integration in the CI and many type fixes across the code base.Protocol
classes in ty.py
to define behavior of pipeline components.displacy
.spacy project assets
.train
function to run the training from Python scripts just like the spacy train
CLI.spacy-transformers>=1.1.0
with improved IO.thinc>=8.0.11
with improved gradient clipping.KnowledgeBase.set_entities
.DocBin
constructor.spacy project
title.DependencyMatcher
.textcat
and textcat_multilabel
configurations.Doc
object creation.convert
CLI..pyi
files in the distributed package.deplacy
: CUI-based dependency visualizeripymarkup
: Visualizations for NER and syntax treesPhruzzMatcher
: Find fuzzy matchesspacy-huggingface-hub
: Push spaCy pipelines to the Hugging Face HubspaCyOpenTapioca
: Entity Linking on Wikidataspacy-clausie
: Clause-based information extraction system@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker
Published by svlandeg about 3 years ago
v3
of WandbLogger
now supports optional run_name
and entity
parameters.pos
values for a Doc
or Token
.Matcher
callbacks.config
in create_pipe
.typer
0.4 to provide support for both Click 7 and Click 8.spacy project
workflows.repo
and path
arguments in spacy project
.epoch_resume
in spacy pretrain
.spacy-legacy
in spacy package
dependency detection.spacy package
.StringStore
and the Vocab
.@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker
Published by svlandeg about 3 years ago
SpanCategorizer
predictions..pyi
stub files.spacy package
.INTERSECTS
operator for the Matcher.spacy project
push
and pull
commands.Span.as_doc
calls.da
transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo
).debug data
runs correctly with a custom tokenizer.ISSUBSET
and ISSUPERSET
in schema and docs.no_skip
value for spacy project run
.ConsoleLogger
flush after each logging line.exclude
when serializing the vocab.allow_overlap
default for span categorizer scoring._SP
.@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker
Published by adrianeboyd about 3 years ago
debug data
.load_lookups
return type and docstring.EntityLinker
robust for nO=None
.minn
is not set.debug model
for transformers.ENT_KB_ID
in ner
annotation.Matcher(as_spans)
on spans.Doc.from_docs()
for all empty docs.textcat
with listener.ENT_ID
and NORM
to DocBin
strings.Span.as_doc
.Span
attrs writable.debug data
for textcat
.DocBin
is too large.to/from_bytes
for KnowledgeBase
and EntityLinker
.Span.get_lca_matrix
.attrs.IDS
.spacy.batch_by_words.v1
.EntityRuler
: ent_ids
returns None
for phrases.EntityRuler
.pymorphy2
requirement to pymorphy2
mode in Russian and Ukrainian lemmatizers.Doc
.Span.lemma_
.JsonlReader
path optional.Example.from_dict
.Doc.from_docs
.textcat
with <2 labels.@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD
Published by svlandeg about 3 years ago
noun_chunk
iterator for Dutch.black
& flake8
as pre-commit hooks.spacy.ngram_range_suggester.v1
for suggesting a range of n-gram sizes for the spancat
component.ru
and uk
multiprocessing (with spawn
).meta
information with spacy package
.replace_pipe
takes disabled components into account.@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe