Open Source Ecosystems

PyDKPro

PyDKPro provides a Python wrapper for the DKPro Core <https://dkpro.github.io/dkpro-core/>_ NLP framework. DKPro Core itself is based on the UIMA <https://uima.apache.org>_ framework and programmed in Java. Interoperability is achieved via web services deployed as docker container <https://www.docker.com/>_. After containerization, services remain active unless manually disabled or idle for limited period of interval.

Analysis results in DKPro Core are represented as CAS objects. Conversion between Java and Python data structures is based on dkpro-cassis <https://github.com/dkpro/dkpro-cassis>_ We also provide built-in support for spacy <https://spacy.io>_ format and NLTK <https://www.nltk.org>_) format allowing for seamless integration.

PyDKPro is still under heavy development. Feedback is highly appreciated.

Demo Version

For demo purpose, different use cases are provided with working example (mocked) in Examples/UseCases.ipynb

System requirements

Python >=3.6
Git
mvn
Docker (please make sure it is running)

Installation

Creating virtual environment

Install virtualenv if not already installed.

$ python -m pip install virtualenv

Create virtual environment. Replace [env_name] with a name of your choice.

$ mkdir [env_name]

$ virtualenv -p python3 [env_name]

$ python3 -m venv [env_name]

Activate created virtual environment.

For Windows:

$ [env_name]\\Scripts\\activate.bat

For Linux and Mac OS:

$ source [env_name]/bin/activate

Creating virtual environment using conda

Create virtual environment with conda. Replace [env_name] with a name of your choice.

$ conda create --name [env_name] python=3.6

To activate the environment (on Windows, MacOS and Unix),

$ conda activate [env_name]

Clone this repository

$ git clone https://github.com/zesch/pydkpro.git

Install dependencies using pip

$ cd pydkpro

$ python -m pip install -r requirements.txt

$ python -m spacy download en_core_web_sm

Features

Using DKPro Core components directly in Python
Conversion to spaCy
Conversion to NLTK

Usage

How to open examples notebook

$ cd Examples

$ jupyter notebook

Defining an NLP pipeline

A pipeline is build by adding DKPro Core components.

.. code-block:: python

from pydkpro import Pipeline, Component
p = Pipeline(version="2.0.0", language='en')
p.add(Component().clearNlpSegmenter())
p.add(Component().stanfordPosTagger(variant='fast-caseless', printTagSet='false'))
p.build() # fire up the container web service

Note: All parameters are optional and default to best performed model versions.

Run the pipeline

For the triggered pipeline above, a CAS object will be generated for the provided string. This CAS object can be used to retrieve annotations like tokens and POS tags, etc.

.. code-block:: python

cas = p.process('Backgammon is one of the oldest known board games.', language='en')

Note: Language detector is used, if language parameter is not provided.

To return all the tokens:

.. code-block:: python

from pydkpro import DKProCoreTypeSystem as dts
cas.select(dts().token).as_text()

Output:

.. code-block:: output

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']

To return all the pos tags:

.. code-block:: python

cas.select(dts().token).get_pos()

Output:

.. code-block:: output

['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']

Provide UIMA CAS functionality

DKProCoreTypeSystem would allow integration of other type systems to nicely use DKPro Cassis <https://github.com/dkpro/dkpro-cassis>_ with their types systems. Generated cas object provide UIMA CAS functionality. For example:

.. code-block:: python

# add annotation
from pydkpro.cas import Cas
Token = dts().typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token') # define dkpro token
cas = Cas(dts().typesystem)()
cas.sofa_string = "I like cheese ."
tokens = [
    Token(begin=0, end=1, id='0', pos='NNP'),
    Token(begin=2, end=6, id='1', pos='VBD'),
    Token(begin=7, end=13, id='2', pos='IN'),
    Token(begin=14, end=15, id='3', pos='.')
]


for token in tokens:
    cas.add_annotation(token)

Cas token attributes can printed as following:

.. code-block:: python

print([x.get_covered_text() for x in cas.select_all()])
print([x.pos for x in cas.select_all()])

Output:

.. code-block:: output

['I', 'like', 'cheese', '.']
['NNP', 'VBD', 'IN', '.']

Conversion from CAS to spaCy format and vice-versa

Generated CAS objects can also be typecast to the spaCy type system.

.. code-block:: python

from pydkpro import To_spacy, From_spacy
cas = p.process('Backgammon is one of the oldest known board games.', language='en')


for token in To_spacy(cas)():
    print(token.text, token.tag_)

Conversion from spaCy

.. code-block:: python

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
cas = From_spacy(doc)()
print(cas.select(dts().token).get_pos())

Conversion from CAS to NLTK format

NLTK returns a specific format for each type of preprocessing. Here is an example for POS:

.. code-block:: python

from pydkpro.external import To_nltk, From_nltk
print(To_nltk().tagger(cas))

Output:

.. code-block:: output

[('Backgammon', 'NNP'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('oldest', 'JJS'), ('known', 'VBN'), ('board', 'NN'), ('games', 'NNS'), ('.', '.')]

This output can then be used for further integration with other NLTK components:

.. code-block:: python

import nltk
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(To_nltk().tagger(cas))
print(chunked)

Output:

.. code-block:: output

(S (Chunk Backgammon/NNP) is/VBZ one/CD of/IN the/DT oldest/JJS known/VBN board/NN games/NNS ./.)

Conversion from NLTK

PyDKPro also provides reverse functionality where a CAS object can be created from spaCy or NLTK output. In the following example, tokenization is performed using NLTK tweet tokenizer, but POS tagging is done with the DKPro wrapper of Stanford CoreNLP POS tagger using their fast.41 model:

.. code-block:: python

from nltk.tokenize import TweetTokenizer
cas = From_nltk().tokenizer(TweetTokenizer().tokenize('Backgammon is one of the oldest known board games.'))

Cas processing

PyDKPro pipeline also provide direct cas object processing as demonstrated in below example:

.. code-block:: python p = Pipeline() p.add(Component().stanfordPosTagger()) p.build()

cas = p.process(cas)

# get tokens
print(cas.select(dts().token()).as_text())

# get pos tags
print(cas.select(dts().token()).get_pos())

Shortcut for running single components

A single component can also be run without the need to build a pipeline first:

.. code-block:: python

tokenizer = Component().clearNlpSegmenter()

cas = tokenizer.process('I like playing cricket.')
print(cas.select(dts().token).as_text())

Output:

.. code-block:: output

['I', 'like', 'playing', 'cricket', '.']

Working with list of strings

Multiple strings in the form of list can also be processed, where each element of list will be considered as document.

.. code-block:: python

str_list = ['Backgammon is one of the oldest known board games.', 'I like playing cricket.']
for str in str_list:
    cas = p.process(str)
    print(cas.select(dts().token).as_text())

Working with text documents

Pipelines can also be directly run on text documents:

.. code-block:: python

from pydkpro.external import File2str

cas = p.process(File2str('test_data/input/test2.txt')())
print(cas.select(dts().token).as_text())

Working with multiple text documents

Multiple documents can also be processed by providing documents path and document name matching patterns

.. code-block:: python

# documents available at different path can be provided in list
docs = ['test_data/input/1.txt', 'test_data/input/2.txt']
for doc in docs:
    p.process(File2str(doc)())

End collection process

With following command pipeline's collection process will be completed (Alternatively, scope operator with can be used)

.. code-block:: python

p.finish()

Related Projects

pnlp

NLP预/后处理工具。

18 Apr 2019 29

preprocess_nlp

A fast framework for pre-processing (Cleaning text, Reduction of vocabulary, Feature extraction ...

04 Feb 2020 8

eluciDoc

Screens legal text and extracts sentences containing user input party name-predicate phrases

24 Jun 2023 1

python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost a...

21 May 2014 29

pyVHDLParser

Streaming based VHDL parser.

21 Sep 2016 78

JapaneseTokenizers

aim to use JapaneseTokenizer as easy as possible

01 Sep 2015 138

neural-network-pos-tagger

Train and evaluate neural network language models for POS tagging, tag input sentences according ...

28 May 2018 12

spacy_conll

Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpi...

19 Dec 2018 72

py3line

UNIX command-line tool for python line-based stream processing

22 Jul 2016 4

tproc

A small yet powerful text processor in Python

01 Aug 2018 8

tabgenie

A multi-purpose toolkit for table-to-text generation: web interface, Python bindings, CLI commands.

12 Oct 2022 48

datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizab...

14 Jun 2023 1,761

minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

16 Feb 2024 9,074

big-parsers-generators-comparison

A code snippet repository that provides examples of how to use different syntax parser generator ...

20 Apr 2023 7

ocrd_detectron2

OCR-D wrapper for detectron2 based segmentation models

21 Jan 2022 16