spaCy | Python Ecosystem Directory

Bot releases are hidden (Show)

spaCy - v3.4.3: Extended Typer support and bug fixes

Published by adrianeboyd almost 2 years ago

✨ New features and improvements

Extend Typer support to v0.7.x (#11720).

🔴 Bug fixes

#11640: Handle docs with no entities in EntityLinker.
#11688: Restore custom doc extension values in Doc.to_json() for attributes set by getters.
#11706: Remove incorrect warning for pipeline_package.load().
#11735: Improve spacy project requirements checks for unsupported specifiers and requirements lines.
#11745: Revert modifications to spacy.load(disable=) that could enable currently disabled components.

👥 Contributors

@aaronzipp, @adrianeboyd, @honnibal, @ines, @polm, @rmitsch, @ryndaniels, @svlandeg, @thomashacker

spaCy - v3.4.2: Latin and Luganda support, Python 3.11 wheels and more

Published by adrianeboyd almost 2 years ago

✨ New features and improvements

NEW: Luganda language support (#10847).
NEW: Latin language support (#11349).
NEW: spacy.ConsoleLogger.v2 optionally saves training logs to JSONL (#11214).
NEW: New operators for the DependencyMatcher to include matching parents or children to the left or the right of the node (#10371).
Prebuilt Python 3.11 wheels are now available for all spaCy dependencies distributed by @explosion.
Support pydantic v1.10 and mypy 0.980+, drop mypy support for Python 3.6 (#11546, #11635).
Support CuPy v11 and add extras for cuda11x and cuda-autodetect (using cupy-wheel) (#11279).
Support custom attributes for tokens and spans in Doc.to_json() and Doc.from_json() (#11125).
Make the enable and disable options for spacy.load() more consistent (#11459).
Allow a single string argument for disable/enclude/exclude for spacy.load() (#11406).
New --url flag for spacy info to print the direct download URL for a pipeline (#11175).
Add a check for missing requirements in the spacy project CLI (#11226).
Add a Levenshtein distance function (#11418).
Improvements to the spacy debug data CLI for spancat data (#11504).
Allow overriding spacy_version in spacy package metadata (#11552).
Improve the error message when using the wrong command for spacy project assets (#11458).
Ensure parent directories are created when storing the results of the spacy pretrain command (#11210).
Extend support to newer versions of natto-py for the ko extra (#11222).

📦 Trained pipelines updates

This release includes updated English pipelines for spaCy v3.4 with improved NER performance. The updates in en_core_web_* v3.4.1 address issues related to training from data with partial named entity annotation, which led to lower NER recall in English pipeline versions v3.0.0–v3.4.0. In particular, entities that appear in the sections of the OntoNotes training data without NER annotation were not predicted consistently by the earlier pipeline versions, such as names and places that are frequent in the Biblical sections, e.g., "David" and "Egypt" (see #7493).

Use spacy download to update your English pipelines to the newest version. If you'd prefer to keep using an earlier version, you can specify the version directly with e.g. spacy download -d en_core_web_sm-3.4.0. You can check that you are using the new version (v3.4.1) with spacy validate:

NAME                     SPACY            VERSION
en_core_web_md           >=3.4.0,<3.5.0   3.4.1     ✔

🔴 Bug fixes

#11275: Fix Dutch noun chunks to skip overlapping spans.
#11276: Fix regex invalid escape sequences.
#11312: Better handling of unexpected types in SetPredicate.
#11460: Fix config validation failures caused by NVTX pipeline wrappers.
#11506: Avoid unwanted side effects in Doc.__init__.
#11540: Preserve missing entity annotation in augmenters.
#11592: Fix issues with DVC commands.
#11631: Fix initialization for pymorphy2_lookup lemmatizer mode for Russian and Ukrainian.

⚠️ Backwards incompatibilities

If you're using a custom component that does not return a Doc type, an error will now be raised (#11424).
If you're using a dot in a factory name, an error is raised as this is not supported (#11336).

📖 Documentation and examples

Added documentation for the new experimental coref component.
Added Ukrainian trained pipelines to the website.
Added documentation for the spacy.models_and_pipes_with_nvtx_range.v1 callback.
Fix English pipeline names in v3.4 release notes.
Various fixes to the Example API documentation.
Extensions and improvements to the displacy docs.
Fix the example command for spacy project dvc.
Update example code for spacy-wordnet.
Improve API documentation around the initialize() function for pipeline components.
Fix various typos and inconsistencies.
spaCy universe additions:
- concepCy: A spaCy wrapper for ConceptNet.
- spaCy partial tagger: build a CRF tagger with a partially annotated dataset.
- Zshot: Zero and Few shot named entity & relationships recognition.

👥 Contributors

@adrianeboyd, @bdura, @danieldk, @diyclassics, @DSLituiev, @GabrielePicco, @honnibal, @ines, @JulesBelveze, @kadarakos, @ljvmiranda921, @ninjalu, @pmbaumgartner, @polm, @radandreicristian, @richardpaulhudson, @rmitsch, @shadeMe, @stefawolf, @svlandeg, @thomashacker, @tobiusaolo, @tzussman , @yasufumy

spaCy - v2.3.8: Updates for Python 3.10 and 3.11

Published by adrianeboyd almost 2 years ago

✨ New features and improvements

Updates and binary wheels for Python 3.10 and 3.11.

👥 Contributors

@adrianeboyd, @honnibal, @ines

spaCy - v3.4.1: Fix compatibility with CuPy v9.x

Published by adrianeboyd about 2 years ago

🔴 Bug fixes

Fix issue #11137: Fix compatibility with CuPy v9.x.

📖 Documentation and examples

spaCy universe additions:
- BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.
- English Interpretation Sentence Pattern: English interpretation for accurate translation from English to Japanese.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines, @lll-lll-lll-lll, @Lucaterre, @MaartenGr, @mr-bjerre, @polm, @radenkovic

spaCy - v3.4.0: Updated types, speed improvements and pipelines for Croatian

Published by adrianeboyd over 2 years ago

✨ New features and improvements

Support for mypy 0.950+ and pydantic v1.9 (#10786).
Prebuilt linux aarch64 wheels are now available for all spaCy dependencies distributed by @explosion.
Min/max {n,m} operator for Matcher patterns (#10981).
Language updates:
- Improve tokenization for Cyrillic combining diacritics (#10837).
- Improve English tokenizer exceptions for contractions with this/that/these/those (#10873).
Improved speed of vector lookups (#10992).
For the parser, use C saxpy/sgemm provided by the Ops implementation in order to use Accelerate through thinc-apple-ops (#10773).
Improved speed of Example.get_aligned_parse and Example.get_aligned (#10952).
Improved speed of StringStore lookups (#10938).
Updated spacy project clone to try both main and master branches by default (#10843).
Added confidence threshold for named entity linker (#11016).
Improved handling of Typer optional default values for init_config_cli (#10788).
Added cycle detection in parser projectivization methods (#10877).
Added counts for NER labels in debug data (#10960).
Support for adding NVTX ranges to TrainablePipe components (#10965).
Support env variable SPACY_NUM_BUILD_JOBS to specify the number of build jobs to run in parallel with pip (#11073).

📦 Trained pipelines updates

We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.

Package	UPOS	Parser LAS	NER F
`hr_core_news_sm`	96.6	77.5	76.1
`hr_core_news_md`	97.3	80.1	81.8
`hr_core_news_lg`	97.5	80.4	83.0

🙏 Special thanks to @gtoffoli for help with the new pipelines!

The English pipelines have new word vectors:

Package	Model Version	TAG	Parser LAS	NER F
`en_core_news_md`	v3.3.0	97.3	90.1	84.6
`en_core_news_md`	v3.4.0	97.2	90.3	85.5
`en_core_news_lg`	v3.3.0	97.4	90.1	85.3
`en_core_news_lg`	v3.4.0	97.3	90.2	85.6

All CNN pipelines have been extended to add whitespace augmentation.

🔴 Bug fixes

Fix issue #10960: Support hyphens in NER labels.
Fix issue #10994: Fix horizontal spacing for spans in displaCy.
Fix issue #11013: Check for any token with a vector in Doc.has_vector, distinguish 0-vectors and missing vectors in similarity warnings.
Fix issue #11056: Don't use get_array_module in textcat.
Fix issue #11092: Fix vertical alignment for spans in displaCy.

🚀 Notes about upgrading from v3.3

Doc.has_vector now matches Token.has_vector and Span.has_vector: it returns True if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.

📖 Documentation and examples

spaCy universe additions:
- Aim-spacy: An Aim-based spaCy experiment tracker.
- Asent: Fast, flexible and transparent sentiment analysis.
- spaCy fishing: Named entity disambiguation and linking on Wikidata in spaCy with Entity-Fishing.
- spacy-report: Generates interactive reports for spaCy models.

👥 Contributors

@adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere

spaCy - v3.3.1: New Span Ruler component, JSON (de)serialization of Doc, span analyzer and more

Published by danieldk over 2 years ago

✨ New features and improvements

Add the SpanRuler component. This component saves a list of matched spans to Doc.spans[spans_key].
Support for JSON serialization and deserialization of Doc objects.
Add span analysis to debug data.
Allow data assets to be made optional in a spaCy project.
Prebuilt macOS ARM64 wheels are now available for all spaCy dependencies distributed by @Explosion.

🔴 Bug fixes

Fix issue #9575: Fix Entity Linker with tokenization mismatches between gold and predicted Doc objects.
Fix issue #10685: Fix serialization of SpanGroup objects that share the same name within one SpanGroups container.
Fix issue #10718: Remove debug print statements in walk_head_nodes to avoid acquiring the GIL.
Fix issue #10741: Make the StringStore.__getitem__ return type dependent on its parameter type.
Fix issue #10734: Support removal of overlapping terms in PhraseMatcher.
Fix issue #10772: Override SpanGroups.setdefault to also support Iterable[SpanGroup] as the default.
Fix issue #10817: Ensure that the term ROOT is in the glossary.
Fix issue #10830: Better errors for Doc.has_annotation and Matcher.
Fix issue #10864: Avoid pickling Doc inputs passed to Language.pipe().
Fix issue #10898: Fix schemas import in Doc.

⚠️ Backward incompatibilities

Before this release, a validation bug allowed the configuration of a pipeline component to override the name of the pipeline itself through the name attribute. For example, the following pipeline component:
```
[components.transformer]
factory = "transformer"
name = "custom_transformer_name"
```
would be registered erroneously as custom_transformer_name. Such overrides are now ignored and a warning is emitted (#10779). From spaCy v3.3.1 onwards, this component will be registered as transformer.

👥 Contributors

@adrianeboyd, @danieldk, @freddyheppell, @honnibal, @ines, @kadarakos, @ldorigo, @ljvmiranda921, @maxTarlov, @pmbaumgartner, @polm, @pypae, @richardpaulhudson, @rmitsch, @shadeMe, @single-fingal, @svlandeg

spaCy - v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish

Published by adrianeboyd over 2 years ago

✨ New features and improvements

Improved speeds for many components, see speed benchmarks for trained pipelines:
- Speed up parser and NER by using constant-time head lookups (#10048).
- Support unnormalized softmax probabilities in spacy.Tagger.v2 to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197).
- Speed up parser projectivization functions (#10241).
- Replace Ragged with faster AlignmentArray in Example for training (#10319).
- Improve Matcher speed (#10659).
- Improve serialization speed for empty Doc.spans (#10250).
NEW: A trainable lemmatizer component that uses edit trees to transform tokens to lemmas. Add it to your config with spacy init config -p trainable_lemmatizer or using the quickstart.
Language updates:
- Initial support for Lower Sorbian and Upper Sorbian.
- New noun chunks for Finnish.
- Updated noun chunks for French, Italian and Spanish.
- Additional updates for English, French, Italian, Japanese, Korean, Norwegian, Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
Big endian support with thinc v8.0.14+ and thinc-bigendian-ops.
Config comparisons with spacy debug diff-config.
displaCy support for overlapping span annotation and multiple labeled arcs between the same tokens.
SpanCategorizer.set_candidates for debugging span suggesters.
The quickstart now supports adding spancat and trainable_lemmatizer components.

📦 Trained pipelines

v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

Package	Language	UPOS	Parser LAS	NER F
`fi_core_news_sm`	Finnish	92.5	71.9	75.9
`fi_core_news_md`	Finnish	95.9	78.6	80.6
`fi_core_news_lg`	Finnish	96.2	79.4	82.4
`ko_core_news_sm`	Korean	86.1	65.6	71.3
`ko_core_news_md`	Korean	94.7	80.9	83.1
`ko_core_news_lg`	Korean	94.7	81.3	85.3
`sv_core_news_sm`	Swedish	95.0	75.9	74.7
`sv_core_news_md`	Swedish	96.3	78.5	79.3
`sv_core_news_lg`	Swedish	96.3	79.1	81.1

🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!

The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.

Model	v3.2 Lemma Acc	v3.3 Lemma Acc
`da_core_news_md`	84.9	94.8
`de_core_news_md`	73.4	97.7
`el_core_news_md`	56.5	88.9
`fi_core_news_md`	-	86.2
`it_core_news_md`	86.6	97.2
`ko_core_news_md`	-	90.0
`lt_core_news_md`	71.1	84.8
`nb_core_news_md`	76.7	97.1
`nl_core_news_md`	81.5	94.0
`pl_core_news_md`	87.1	93.7
`pt_core_news_md`	76.7	96.9
`ro_core_news_md`	81.8	95.5
`sv_core_news_md`	-	95.5

🔴 Bug fixes

Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
Fix issue #9443: Fix Scorer.score_cats for missing labels.
Fix issue #9669: Fix entity linker batching.
Fix issue #9903: Handle _ value for UPOS in CoNLL-U converter.
Fix issue #9904: Fix textcat loss scaling.
Fix issue #9956: Compare all Span attributes consistently.
Fix issue #10073: Add "spans" to the output of doc.to_json.
Fix issue #10086: Add tokenizer option to allow Matcher handling for all special cases.
Fix issue #10189: Allow Example to align whitespace annotation.
Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
Fix issue #10324: Fix Tok2Vec for empty batches.
Fix issue #10347: Update basic functionality for rehearse.
Fix issue #10394: Fix Vectors.n_keys for floret vectors.
Fix issue #10400: Use meta in util.load_model_from_config.
Fix issue #10451: Fix Example.get_matching_ents.
Fix issue #10460: Fix initial special cases for Tokenizer.explain.
Fix issue #10521: Stream large assets on download in spaCy projects.
Fix issue #10536: Handle unknown tags in KoreanTokenizer tag map.
Fix issue #10551: Add automatic vector deduplication for init vectors.

🚀 Notes about upgrading from v3.2

To see the speed improvements for the Tagger architecture, edit your configs to switch from spacy.Tagger.v1 to spacy.Tagger.v2 and then run init fill-config.
Span comparisons involving ordering (<, <=, >, >=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956).
Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
Doc.from_docs now includes Doc.tensor by default and supports excludes with an exclude argument in the same format as Doc.to_bytes. The supported exclude fields are spans, tensor and user_data.

📖 Documentation and examples

spaCy universe additions:
- classy-classification: A Python library for classy few-shot and zero-shot classification within spaCy.
- Concise Concepts: Concise Concepts uses few-shot NER based on word embedding similarity.
- Crosslingual Coreference: Crosslingual coreference with an English coreference model plus crosslingual embeddings.
- EDS-NLP: spaCy components to extract information from clinical notes written in French.
- HuSpaCy: Industrial-strength Hungarian natural language processing.
- Klayers: spaCy as a AWS Lambda Layer.
- Named Entity Recognition (NER) using spaCy (video).
- Scrubadub: Remove personally identifiable information from text using spaCy.
- spacy-setfit-textcat: Experiments with SetFit & Few-Shot Classification.
- tmtoolkit: Text mining and topic modeling toolkit.

👥 Contributors

@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996

spaCy - v3.1.6: Workaround for Click/Typer issues

Published by adrianeboyd over 2 years ago

🔴 Bug fixes

Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

👥 Contributors

@adrianeboyd, @honnibal, @ines

spaCy - v3.2.4: Workaround for Click/Typer issues

Published by adrianeboyd over 2 years ago

🔴 Bug fixes

Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

👥 Contributors

@adrianeboyd, @honnibal, @ines

spaCy - v3.2.3: Fix Tok2Vec for empty batches

Published by adrianeboyd over 2 years ago

🔴 Bug fixes

Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @honnibal, @ines

spaCy - v3.1.5: Bug fixes for Tok2Vec, SpanCategorizer, and more

Published by adrianeboyd over 2 years ago

🔴 Bug fixes

Fix issue #9593: Use metaclass to subclass errors for easier pickling.
Fix issue #9654: Fix spancat for empty docs and zero suggestions.
Fix issue #9979: Fix type of Lexeme.rank.
Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz

spaCy - v3.0.8: Fix Tok2Vec for empty batches

Published by adrianeboyd over 2 years ago

🔴 Bug fixes

Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines

spaCy - v3.2.2: Improved NER and parser speeds, bug fixes and more

Published by adrianeboyd over 2 years ago

✨ New features and improvements

Improved parser and ner speeds on long documents (see technical details in #10019).
Support for spancat components in debug data.
Support for ENT_IOB as a Matcher token pattern key.
Extended and improved types for many classes.

🔴 Bug fixes

Fix issue #9735: Make floret murmurhash endian-neutral.
Fix issue #9738: Support string IOB values for ENT_IOB.
Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.
Fix issue #9960: Warn about entities that cross sentence boundaries in debug data.
Fix issue #9979: Fix type for Lexeme.rank.
Fix issue #10026: Check for 0-size assets in spacy project.
Fix issue #10051: Consistently return scalars from similarity methods.
Fix issue #10052: Fix spaces in Doc.from_docs() for empty docs.
Fix issue #10079: Fix label detection in debug data for components with custom names.
Fix issue #10109: Add types to Underscore and DependencyMatcher and improve types in Language, Matcher and PhraseMatcher.
Fix issue #10130: Fix Tokenizer.explain when infixes appear as prefixes.
Fix issue #10143: Use simple suggester in spancat initialization.
Fix issue #10164: Support IS_SENT_END in Doc.has_annotation.
Fix issue #10192: Detect invalid package names in spacy package.
Fix issue #10223: Support mixed case in package names.
Fix issue #10234: Fix type in PhraseMatcher.

📖 Documentation and examples

Various documentation updates.
New spaCy version tags in spaCy universe.
New Dockerfile for repeatable website builds and easier local development.
New additions to spaCy universe:
- Augmenty: a text augmentation library
- Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects
- spacy-wrap: wrap fine-tuned transformers in spaCy pipelines
- spacypdfreader: easy PDF to text to spaCy text extraction
- textnets: text analysis with networks

👥 Contributors

@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav

spaCy - v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more

Published by adrianeboyd almost 3 years ago

✨ New features and improvements

NEW: doc_cleaner component for removing doc.tensor,doc._._trf_data or other Doc attributes at the end of the pipeline to reduce size of output docs.
NEW: ENT_ID and ENT_KB_ID to Matcher pattern attributes.
Support kb_id for entities in displaCy from Doc input.
Add Span.sents property for spans spanning over more than one sentence.
Add EntityRuler.remove to remove patterns by id.
Make the Tagger neg_prefix configurable.
Use Language.pipe in Language.evaluate for more efficient processing.
Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.

🔴 Bug fixes

Fix issue #9638: Make JsonlCorpus path optional again.
Fix issue #9654: Fix spancat for empty docs and zero suggestions.
Fix issue #9658: Improve error message for incorrect .jsonl paths in EntityRuler.
Fix issue #9674: Fix language-specific factory handling in package CLI.
Fix issue #9694: Convert labels to strings for README in package CLI.
Fix issue #9697: Exclude strings from source vector checks.
Fix issue #9701: Allow Scorer.score_spans to handle predicted docs with missing annotation.
Fix issue #9722: Initialize parser from reference parse rather than aligned example.
Fix issue #9764: Set annotations more efficiently in tagger and morphologizer.

📖 Documentation and examples

Various documentation updates: init_tok2vec after pretraining, batch contract for listeners.
New additions to the spaCy universe:
- eng-spacysentiment: Sentiment analysis for English.
- Applied Language Technology course: NLP for newcomers using spaCy and Stanza.

👥 Contributors

@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar

spaCy - v3.2.0: Registered scoring functions, Doc input, floret vectors and more

Published by adrianeboyd almost 3 years ago

✨ New features and improvements

NEW: Registered scoring functions for each component in the config.
NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.
NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.
extend config setting for morphologizer for whether existing feature types are preserved.
Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.
New package spacy-loggers for additional loggers.
New Irish lemmatizer.
New Portuguese noun chunks and updated Spanish noun chunks.
Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
Japanese reading and inflection from sudachipy are annotated as Token.morph features.
Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().
LIKE_URL attribute includes the tokenizer URL pattern.
--n-save-epoch option for spacy pretrain.
Trained pipelines:
- New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!
- Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
- English attribute ruler patterns updated to improve Token.pos and Token.morph.

For more details, see the New in v3.2 usage guide.

🔴 Bug fixes

Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.
Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.
Fix issue #9584: Use metaclass to subclass errors to allow better pickling.

⚠️ Backwards incompatibilities

In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of °[cfk]. is now ° c . instead of ° c. for most languages.
The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.
In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

📖 Documentation and examples

Demo projects for floret vectors:
- pipelines/floret_vectors_demo: basic floret vector training and importing.
- pipelines/floret_fi_core_demo: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.
- pipelines/floret_ko_ud_demo: Korean UD vector and pipeline training, comparing standard vs. floret vectors.

👥 Contributors

@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker

spaCy - v3.1.4: Python 3.10 wheels and support for AppleOps

Published by svlandeg almost 3 years ago

✨ New features and improvements

NEW: Binary wheels for Python 3.10.
NEW: Improve performance on Apple M1 with AppleOps: pip install spacy[apple].
GPU profiling with spacy.models_with_nvtx_range.v1.
Full mypy integration in the CI and many type fixes across the code base.
Added custom Protocol classes in ty.py to define behavior of pipeline components.
Support for entity linking visualization in displacy.
Allow overriding vars in spacy project assets .
Standalone train function to run the training from Python scripts just like the spacy train CLI.
Support for spacy-transformers>=1.1.0 with improved IO.
Support for thinc>=8.0.11 with improved gradient clipping.

🔴 Bug fixes

Fix issue #5507: Improve UX for multiprocessing on GPU.
Fix issue #9137: Fix serialization for KnowledgeBase.set_entities.
Fix issue #9244: Fix vectors for 0-length spans.
Fix issue #9247: Improve UX for the DocBin constructor.
Fix Issue #9254: Allow unicode in a spacy project title.
Fix issue #9263: Make added patterns consistent in the DependencyMatcher.
Fix issue #9305: Restore tokenization timing during evaluation.
Fix issue #9335: Sync vocab in vectors and sourced components.
Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
Fix issue #9404: Create consistent default textcat and textcat_multilabel configurations.
Fix issue #9437: Improve UX around Doc object creation.
Fix issue #9465: Fix minor issues with convert CLI.
Fix issue #9500: Include .pyi files in the distributed package.

📖 Documentation and examples

Various updates to the documentation.
New additions to the spaCy universe:
- deplacy: CUI-based dependency visualizer
- ipymarkup: Visualizations for NER and syntax trees
- PhruzzMatcher: Find fuzzy matches
- spacy-huggingface-hub: Push spaCy pipelines to the Hugging Face Hub
- spaCyOpenTapioca: Entity Linking on Wikidata
- spacy-clausie: Clause-based information extraction system
- "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
- "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly

👥 Contributors

@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker

spaCy - v3.1.3: Bug fixes and UX updates

Published by svlandeg about 3 years ago

✨ New features and improvements

The v3 of WandbLogger now supports optional run_name and entity parameters.
Improved UX when providing invalid pos values for a Doc or Token.

🔴 Bug fixes

Fix issue #9001: Pass alignments to Matcher callbacks.
Fix issue #9009: Include component factories in third-party dependencies resolver.
Fix issue #9012: Correct type of config in create_pipe.
Fix issue #9014: Allow typer 0.4 to provide support for both Click 7 and Click 8.
Fix issue #9033: Fix verbs list for French tokenizer exceptions.
Fix issue #9059: Pass overrides to subcommands in spacy project workflows.
Fix issue #9074: Improve UX around repo and path arguments in spacy project.
Fix issue #9084: Fix inference of epoch_resume in spacy pretrain.
Fix issue #9163: Handle spacy-legacy in spacy package dependency detection.
Fix issue #9211: Include only runtime-relevant dependencies in spacy package.

📖 Documentation and examples

Various updates to the documentation.
Few additions and updates to the spaCy universe.
Extended the developer documentation with information about the listener pattern, the StringStore and the Vocab.

👥 Contributors

@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker

spaCy - v3.1.2: Improved spancat component and various bugfixes

Published by svlandeg about 3 years ago

✨ New features and improvements

NEW: Provide scores for the SpanCategorizer predictions.
NEW: Broader compatibility with type checkers thanks to .pyi stub files.
NEW: Auto-detect package dependencies in spacy package.
New INTERSECTS operator for the Matcher.
More debugging info for spacy project push and pull commands.
Allow passing in a precomputed array for speeding up multiple Span.as_doc calls.
The default da transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo).

🔴 Bug fixes

Fix issue #8767: Fix offsets of empty and out-of-bounds spans.
Fix issue #8774: Ensure debug data runs correctly with a custom tokenizer.
Fix issue #8784: Fix incorrect ISSUBSET and ISSUPERSET in schema and docs.
Fix issue #8796: Respect the no_skip value for spacy project run.
Fix issue #8810: Make ConsoleLogger flush after each logging line.
Fix issue #8819: Pass exclude when serializing the vocab.
Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.
Fix issue #8970: Fix allow_overlap default for span categorizer scoring.
Fix issue #8982: Add glossary entry for _SP.
Fix issue #9007: Fix span categorizer training on nested entities.

📖 Documentation and examples

New developer documentation covering spaCy's internals and code conventions.
Added a documentation section on preparing training data in spaCy's binary format.
Updated some error/log messages to be more informative.
Various updates to the documentation.
A few new additions to the spaCy universe.

👥 Contributors

@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker

spaCy - v3.0.7: Bug fixes and base support for Azerbaijani

Published by adrianeboyd about 3 years ago

✨ New features and improvements

Alpha tokenization support for Azerbaijani.
Updates for French stop words.

🔴 Bug fixes

Fix issue #7629: Fix scoring normalization.
Fix issue #7886: Fix unknown tokens percentage in debug data.
Fix issue #7907: Update load_lookups return type and docstring.
Fix issue #7930: Make EntityLinker robust for nO=None.
Fix issue #7925: Skip vector ngram backoff if minn is not set.
Fix issue #7973: Fix debug model for transformers.
Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
Fix issue #7992: Fix span offsets for Matcher(as_spans) on spans.
Fix issue #8004: Handle errors while multiprocessing.
Fix issue #8009: Fix Doc.from_docs() for all empty docs.
Fix issue #8012: Fix ensemble textcat with listener.
Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
Fix issue #8055: Handle partial entities in Span.as_doc.
Fix issue #8062: Make all Span attrs writable.
Fix issue #8066: Update debug data for textcat.
Fix issue #8069: Custom warning if DocBin is too large.
Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
Fix issue #8116: Fix offsets in Span.get_lca_matrix.
Fix issue #8132: Remove unsupported attrs from attrs.IDS.
Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
Fix issue #8208: Address missing config overrides post load of models.
Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
Fix issue #8216: Don't add duplicate patterns in EntityRuler.
Fix issue #8244: Use context manager when reading model file.
Fix issue #8245: Fix other open calls without context managers.
Fix issue #8265: Address mypy errors.
Fix issue #8299: Restrict pymorphy2 requirement to pymorphy2 mode in Russian and Ukrainian lemmatizers.
Fix issue #8335: Raise error if deps not provided with heads in Doc.
Fix issue #8368: Preserve whitespace in Span.lemma_.
Fix issue #8396: Make JsonlReader path optional.
Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
Fix issue #8423: Update validate CLI to fix compat and ignore warnings.
Fix issue #8426: Fix setting empty entities in Example.from_dict.
Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
Fix issue #8584: Raise an error for textcat with <2 labels.
Fix issue #8551: Fix duplicate spacy package CLI opts.

👥 Contributors

@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD

spaCy - v3.1.1: Support for Ancient Greek and various bug fixes

Published by svlandeg about 3 years ago

✨ New features and improvements

Alpha tokenization support for Ancient Greek.
Implementation of a noun_chunk iterator for Dutch.
Support for black & flake8 as pre-commit hooks.
New spacy.ngram_range_suggester.v1 for suggesting a range of n-gram sizes for the spancat component.

🔴 Bug fixes

Fix issue #8638: Fix Azerbaijani initialization.
Fix issue #8639: Use 0-vector for OOV lexemes.
Fix issue #8640: Update lexeme ranks for loaded vectors.
Fix issue #8651: Fix ru and uk multiprocessing (with spawn).
Fix issue #8663: Preserve existing meta information with spacy package.
Fix issue #8718: Ensure that replace_pipe takes disabled components into account.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe