๐ซ Industrial-strength Natural Language Processing (NLP) in Python
MIT License
Bot releases are hidden (Show)
Published by svlandeg 4 months ago
spacy download
(#13313).typing-extensions<5.0.0
for Python < 3.8 (#13516).use_gold_ents
behaviour for EntityLinker.MorphAnalysis
(#13433).@danieldk, @honnibal, @ines, @JoeSchiff, @nokados, @Paillat-dev, @rmitsch, @schorfma, @strickvl, @svlandeg, @ynx0
Published by danieldk 8 months ago
TextCatReduce.v1
layer for text classification (#13181).TextCatParametricAttention.v1
layer for text classification (#13201).build
module for creating model packages by default (#13109).benchmark speed
command (#13247).Language.pipe
.Doc
.Tokenizer.explain
for special cases with whitespace.SparseLinear
layer.trf_data
examples and the transformer pipeline design section.@adrianeboyd, @danieldk, @evornov, @honnibal, @ines, @lise-brinck, @ridge-kimani, @rmitsch, @shadeMe, @svlandeg
Published by adrianeboyd about 1 year ago
__all__
fields (#13063).spacy.cli.project
API.Any
comparisons for Token
and Span
.spacy-llm
including Azure OpenAI, PaLM, and Mistral support.@adrianeboyd, @honnibal, @ines, @rmitsch, @svlandeg
Published by adrianeboyd about 1 year ago
spacy.info
to fix availability of spacy.cli
following import spacy
(#13040).@adrianeboyd, @honnibal, @ines, @svlandeg
Published by adrianeboyd about 1 year ago
This release drops support for Python 3.6 and adds support for Python 3.12.
spacy project
commands should run as before, just now they're using Weasel under the hood.transformers
extra to spacy-transformers
v1.3 (#13025).--spans-key
option for CLI evaluation with spacy benchmark accuracy
(#12981).spacy.info
(#12962).spacy.training.example
(#12801).Language.replace_listeners
: Pass the replaced listener and the tok2vec
pipe to the callback in order to support spacy-curated-transformers
(#12785).tqdm
with disable=None
to disable output in non-interactive environments (#12979).The transformer-based trf
pipelines have been updated to use our new Curated Transformers library through the Thinc model wrappers and pipeline component from spaCy Curated Transformers.
ray
extra.spacy project
has a few backwards incompatibilities due to the transition to the standalone library Weasel, which is not as tightly coupled to spaCy. Weasel produces warnings when it detects older spaCy-specific settings in your environment or project config.
spacy_version
configuration key has been dropped.check_requirements
configuration key has been dropped due to the deprecation of pkg_resources
.SPACY_CONFIG_OVERRIDES
environment variable is no longer checked. You can set configuration overrides using WEASEL_CONFIG_OVERRIDES
.SPACY_PROJECT_USE_GIT_VERSION
environment variable has been dropped.@adrianeboyd, @bdura, @connorbrinton, @danieldk, @davidberenstein1957, @denizcodeyaa, @eltociear, @evornov, @honnibal, @ines, @jmyerston, @koaning, @magdaaniol, @pdhall99, @ringohoffman, @rmitsch, @senisioi, @shadeMe, @svlandeg, @vinbo8, @wjbmattingly
Published by adrianeboyd about 1 year ago
find-function
CLI for finding locations of registered functions (#12757).spacy[cuda12x]
for cupy-cuda12x
(#12890).init config
and train
CLI (#12173).distutils
to setuptools
/sysconfig
(#12853).<br>
tags in displaCy.@adrianeboyd, @afriedman412, @arplusman, @bdura, @connorbrinton, @honnibal, @ines, @it176131, @pmbaumgartner, @rmitsch, @shadeMe, @svlandeg, @thomashacker, @victorialslocum, @x-tabdeveloping
Published by adrianeboyd over 1 year ago
span_finder
pipeline component to identify overlapping, unlabeled spans (#12507).spacy evaluate --per-component
, Language.evaluate(per_component=True)
and Scorer.score(per_component=True)
(#12540).spancat_singlelabel
in spacy debug data
CLI (#12749).PhraseMatcher
and SpanGroup
(#12642, #12714).SpanGroup
spans come from the current doc.We have added new pipelines for Slovenian that use the trainable lemmatizer and floret vectors.
Package | UPOS | Parser LAS | NER F |
---|---|---|---|
sl_core_news_sm |
96.9 | 82.1 | 62.9 |
sl_core_news_md |
97.6 | 84.3 | 73.5 |
sl_core_news_lg |
97.7 | 84.3 | 79.0 |
sl_core_news_trf |
99.0 | 91.7 | 90.0 |
The English pipelines have been updated to improve handling of contractions with various apostrophes and to lemmatize "get" as a passive auxiliary.
The Danish pipeline da_core_news_trf
has been updated to use vesteinn/DanskBERT
with performance improvements across the board.
SpanGroup
spans are now required to be from the same doc. When initializing a SpanGroup
, there is a new check to verify that all added spans refer to the current doc. Without this check, it was possible to run into string store or other errors.@adrianeboyd, @bdura, @danieldk, @davidberenstein1957, @diyclassics, @essenmitsosse, @honnibal, @ines, @isabelizimm, @jmyerston, @kadarakos, @KennethEnevoldsen, @khursani8, @ljvmiranda921, @rmitsch, @shadeMe, @svlandeg, @tomaarsen, @victorialslocum, @vin-ivar, @ZiadAmerr
Published by adrianeboyd over 1 year ago
@adrianeboyd, @bdura, @honnibal, @ines, @svlandeg
Published by adrianeboyd over 1 year ago
This bug fix release is primarily to address Pydantic incompatibility with typing_extensions>=4.6.0
.
spancat
, in particular on GPU (~10x-30x faster) (#12577).typing_extensions
requirement due to Pydantic incompatibility with typing_extensions>=4.6.0
.#egg
from download URLs due to future deprecation in pip
.@adrianeboyd, @honnibal, @ines, @kadarakos, @svlandeg
Published by adrianeboyd over 1 year ago
This bug fix release is primarily to address Pydantic incompatibility with typing_extensions>=4.6.0
.
spancat
, in particular on GPU (~10x-30x faster) (#12577).typing_extensions
requirement due to Pydantic incompatibility with typing_extensions>=4.6.0
.#egg
from download URLs due to future deprecation in pip
.@adrianeboyd, @honnibal, @ines, @kadarakos, @svlandeg
Published by adrianeboyd over 1 year ago
spancat
, in particular on GPU (~10x-30x faster) (#12577).>+
, >-
, >++
, >--
) for the dependency matcher (#12528).doc.spans
for displaCy output in spacy benchmark accuracy
/ spacy evaluate
(#12575).MorphAnalysis.get(default=)
argument for user-provided default values similar to dict
(#12545).#egg
from download URLs due to future deprecation in pip
.@adrianeboyd, @andyjessen, @bdura, @davidberenstein1957, @diyclassics, @honnibal, @ines, @kadarakos, @KennethEnevoldsen, @ljvmiranda921, @moxley01, @royashcenazi, @svlandeg, @tanloong, @victorialslocum
Published by adrianeboyd over 1 year ago
spacy pretrain
(#12435).model-last.bin
for spacy pretrain
(#12459).Span
input for displacy.parse_deps
(#12477).cupy
install extras.Span.sents
.spancat_singlelabel
.Span.sents
when the final sentence is the last token in a Doc
.Span.kb_id
and Span.id
strings in Doc
and DocBin
serialization.@adrianeboyd, @BLKSerene, @honnibal, @ines, @kadarakos, @prajakta-1527, @rmitsch, @shadeMe, @sloev, @svlandeg, @thomashacker, @willfrey
Published by adrianeboyd over 1 year ago
๐ฅ We'd love to hear more about your experience with spaCy! Take our survey here.
spancat_singlelabel
pipeline component for multi-class and non-overlapping span classification. The spancat_singlelabel
component predicts at most one label for each suggested span and adds a new setting allow_overlap
to restrict the output to non-overlapping spans (#11365).transformer
+ CNN for efficient GPU textcat
with spacy init config
(#11900).spacy debug data
(#11419).>+
, >-
, <+
, <-
) (#12334).spacy.PlainTextCorpusReader.v1
for plain text input (#12122).alignment_mode
and span_id
to Span.char_span()
(#12145, #12196).top_k>1
in trainable lemmatizer.test_cli_find_threshold()
test more robust.registry.find()
.Matcher
patterns with extension attributes.grc
to languages with lexeme norms in spacy-lookups-data
.KnowledgeBase
instances configurable.auto_select_port
.InMemoryLookupKB.is_empty
.Lexeme.orth
and Lexeme.lower
.PretrainVectors
.pkg_resources
.@adrianeboyd, @andyjessen, @danieldk, @essenmitsosse, @honnibal, @ines, @itssimon, @kadarakos, @kwhumphreys, @ljvmiranda921, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @shadeMe, @svlandeg, @tanloong, @thomashacker, @victorialslocum
Published by adrianeboyd over 1 year ago
apply
CLI command to annotate new documents with a trained pipeline (#11376).benchmark
CLI command to benchmark pipelines. The new benchmark speed
subcommand measures the speed of a pipeline, the benchmark accuracy
subcommand is a new alias for evaluate
(#11902).find-threshold
CLI command to identify an optimal threshold for classification models (#11280).FUZZY
Matcher
operator for fuzzy matches based on Levenshtein edit distance. In addition, the FUZZY
and REGEX
operators are now supported in combination with IN
/NOT_IN
. (#11359).typer
v0.7.x (#11720), mypy
0.990 (#11801) and typing_extensions
v4.4.x (#12036).spacy.ConsoleLogger.v3
with expanded progress tracking (#11972).textcat
with spacy.textcat_scorer.v2
(#11696 and #11971) and spacy.textcat_multilabel_scorer.v2
(#11820).InMemoryLookupKB
(#11268).before_update
callback that is invoked at the start of each training step (#11739).SpanGroup
(#11380).displacy.serve
when the default port is in use (#11948).tok2vec
version (#11618).tok2vec
or transformer
layer.textcat
.Vocab.to_disk
respects the exclude setting for lookups
and vectors
.SpanGroup
and Span
objects.The following changes may require you to update code that is using the relevant functionality:
textcat
or textcat_multilabel
model - ensure that values are 0.0 or 1.0 as explained in the docs.KnowledgeBase
is now an abstract class, you should call the constructor of the new InMemoryLookupKB
instead when you want to use spaCy's default KB implementation. If you've written a custom KB that inherits from KnowledgeBase
, you'll need to implement its abstract methods, or alternatively inherit from InMemoryLookupKB
instead.The following changes may influence the output of your language pipeline or trained models:
pymorphy3
(#11345, #11811).tok2vec
defaults in all components (#11618).textcat
and textcat_multilabel
components (#11698).textcat
and textcat_multilabel
to fix a bug related to threshold
for textcat
and to make it possible to score multiple textcat
/textcat_multilabel
components in a single pipeline with custom scorers. If no custom scorers are used, the cat_p/r/f
scores will now only reflect the final component's labels and performance (#11696, #11820).token_acc
score to report the intended measure (# correct tokens / # predicted tokens
, the same as in spaCy v2). The token_acc
scores for v3.5 will be lower for the same performance because they were incorrectly inflated in v3.0-v3.4. The token_p/r/f
scores should remain unchanged (#12073).The following functionality will be changed in the near future - so it's best to start updating your scripts now to make them more generic:
master
branch to main
.IS_SPACE
as a tok2vec
feature for tagger
and morphologizer
components to improve tagging of non-whitespace vs. whitespace tokens.spacy-transformers
v1.2, which uses the exact alignment from tokenizers
for fast tokenizers instead of the heuristic alignment from spacy-alignments
. For all trained pipelines except ja_core_news_trf
, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about the spacy-transformers
changes in the v1.2.0 release notes.biluo_to_iob
and iob_to_biluo
functions.@aaronzipp, @adrianeboyd, @albertvillanova, @ArchiDevil, @cfuerbachersparks, @damian-romero, @danieldk, @darigovresearch, @DSLituiev, @essenmitsosse, @gremur, @honnibal, @ines, @jmyerston, @JosPolfliet, @kadarakos, @koaning, @kwhumphreys, @ljvmiranda921, @MarcoGorelli, @orglce, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @ryndaniels, @shadeMe, @svlandeg, @thomashacker, @TrellixVulnTeam, @wannaphong, @zhiiw, @zrpxx
Published by adrianeboyd almost 2 years ago
This release addresses future compatibility with NumPy v1.24+.
@adrianeboyd, @honnibal, @ines, @svlandeg
Published by adrianeboyd almost 2 years ago
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
smart_open
requirement and update deprecated options.spacy init config --gpu
for environments without spacy-transformers
.@adrianeboyd, @honnibal, @ines, @polm, @svlandeg
Published by adrianeboyd almost 2 years ago
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
spancat
for docs with zero suggestions.smart_open
requirement and update deprecated options.spacy init config --gpu
for environments without spacy-transformers
.@adrianeboyd, @honnibal, @ines, @polm, @svlandeg
Published by adrianeboyd almost 2 years ago
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
spancat
for docs with zero suggestions.smart_open
requirement and update deprecated options.spacy init config --gpu
for environments without spacy-transformers
.@adrianeboyd, @honnibal, @ines, @polm, @svlandeg
Published by adrianeboyd almost 2 years ago
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
precomputable_biaffine
by avoiding concatenation.spancat
for docs with zero suggestions.smart_open
requirement and update deprecated options.spacy init config --gpu
for environments without spacy-transformers
.EditTreeLemmatizer
.@adrianeboyd, @danieldk, @honnibal, @ines, @polm, @svlandeg
Published by adrianeboyd almost 2 years ago
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
spancat
for docs with zero suggestions.smart_open
requirement and update deprecated options.spacy init config --gpu
for environments without spacy-transformers
.EditTreeLemmatizer
.@adrianeboyd, @danieldk, @honnibal, @ines, @polm, @svlandeg