Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Doc and its sentences and tokens. Can also be used as a command-line tool.
BSD-2-CLAUSE License
Bot releases are hidden (Show)
Published by BramVanroy over 1 year ago
Full Changelog: https://github.com/BramVanroy/spacy_conll/compare/v3.3.0...v3.4.0
Published by BramVanroy over 1 year ago
Since spaCy 3.2.0, the data that is passed to a spaCy pipeline has become more strict. This means that passing
a list of pretokenized tokens (["This", "is", "a", "pretokenized", "sentence"]
) is not accepted anymore. Therefore,
the is_tokenized
option needed to be adapted to reflect this. It is still possible to pass a string where tokens
are separated by whitespaces, e.g. "This is a pretokenized sentence"
, which will continue to work for spaCy and
stanza. Support for pretokenized data has been dropped for UDPipe.
Specific changes:
is_tokenized
is not a valid argument to ConllParser
any more.SpacyPretokenizedTokenizer.__call__
does not support a list of tokens any more.Published by BramVanroy over 2 years ago
SpaceAfter=No
was not added correctly to tokensConllFormatter
as an entry point, which means that you do not have to importspacy_conll
anymore when you want to add the pipe to a parser! spaCy will know where to look for the CoNLLnlp.add_pipe("conll_formatter")
without you having to import the component manuallymerge_dicts_strict
to utils, outside the formatter classfrom spacy_conll import ConllParser
exclude_spacy_components
argumentdisable_sbd
optionnlp.add_pipe("disable_sbd", before="parser")
Published by BramVanroy over 3 years ago
Published by BramVanroy over 3 years ago
Published by BramVanroy over 3 years ago
This release makes spacy_conll
compatible with spaCy's new v3 release. On top of that some improvements were made to make the project easier to maintain.
is_tokenized
now disables sentence segmentationen
any more. You have to provide the full model name, e.g.en_core_web_sm
parse-as-conll -h
for more information.n_process
. Will try to figure out whether multiprocessingignore_pipe_errors
, both on the command line as in ConllParser's parse methodsPublished by BramVanroy over 3 years ago
spacy-stanfordnlp
. spacy-stanza
is still supportedPublished by BramVanroy over 4 years ago
Fully reworked version!
spacy-stanza
and spacy-udpipe
! (Not included as a dependency, install manually)init_parser
that can easily initialise a parser together with the customdisable_pandas
flag the the formatter class in case you would want to disable setting the pandas._.conll
: raw CoNLL format
Token
: a dictionary containing all the expected CoNLL fields as keys and the parsed properties asSpan
: a list of its tokens' ._.conll
dictionaries (list of dictionaries).Doc
: a list of its sentences' ._.conll
lists (list of list of dictionaries).._.conll_str
: string representation of the CoNLL format
Token
: tab-separated representation of the contents of the CoNLL fields ending with a newline.Span
: the expected CoNLL format where each row represents a token. WhenConllFormatter(include_headers=True)
is used, two header lines are included as well, as per theCoNLL format
_.Doc
: all its sentences' ._.conll_str
combined and separated by new lines.._.conll_pd
: pandas
representation of the CoNLL format
Token
: a Series
representation of this token's CoNLL properties.Span
: a DataFrame
representation of this sentence, with the CoNLL names as columnDoc
: a concatenation of its sentences' DataFrame
's, leading to a new a DataFrame
whosefield_names
has been removed, assuming that you do not need to change the column names of the CoNLL propertiesSpacy2ConllParser
classPublished by BramVanroy over 4 years ago
Spacy2ConllParser
class!Published by BramVanroy over 4 years ago
The documentation has been greatly expanded. The most important addition to the README is the mention and explanation of using spacy-stanfordnlp
. spacy_conll
can be used together with this spaCy wrapper around stanfordnlp
. The benefit is that we can use Stanford models, with a spaCy interface. From a user perspective, this means better models, guaranteed Universal Dependencies tagsets, and an easy API through spaCy. (The cost is that Stanford NLP models are significantly slower than spaCy's models.) Small tests for spacy_stanfordnlp
have been added.
A new feature is that you can now add a custom tagset map (conversion_maps
). The idea is that you, as a user, have more control over the output tags. You can for instance specify that all deprel
tags nsubj
should be renamed to subj
. This is useful if your model uses a different tagset than you want. See the advanced example in the README for more information.
This release closes:
Published by BramVanroy over 4 years ago
This small release adds the dependencies to setup.py
, solving potential issues (e.g. https://github.com/BramVanroy/spacy_conll/issues/3).
Current dependencies are:
Published by BramVanroy almost 5 years ago
This small repo has been overhauled so that users can integrate it directly in their spaCy scripts. You can now use it as a spaCy component. Three custom attributes have been added to Doc._.
and a Doc
's sentences. You can find more information in the README as well as example usage.
The command line script has been improved as well, now using the pipeline component instead of Spacy2ConllParser
. The latter has been deprecated (but is still accessible for now). Multiprocessing via the command line script is now possible, too.