lgpsi-processing

repo to set up processing scripts and data for LGPSI

CC-BY-SA-4.0 License

Stars

5

View Code on GitHub View on X

Ecosystems: Python

LGPSI Processing

Repo for processing LGPSI. For official content, see seumasjeltzz/LinguaeGraecaePerSeIllustrata.

Contributors

Seumas Macdonald
James Tauber

Directory Layout

So far, the only data directories being manually modified are:

orig (the original files from Seumas)
manual-data (data needed for processing, e.g. lemma overrides from Seumas)

The main output directories are:

text (the processed text in GLTP format)
analysis (further analysis of the text in GLTP format)

Other directories are:

cache (for storing the Morpheus cache)
config (for storing configuration like for text-validation)
scripts (where all the code lives)

Scripts

The scripts are run in this order (after dependencies in the Pipfile are installed):

./scripts/orig-to-para.py converts from orig files to para files under text
./scripts/para-to-sent.py converts those para files to sentence-based sent files
./scripts/add-norm.py produces the norm files in analysis from the sent files
./scripts/lemmatise.py produces the lemma files in analysis from the norm files using Morpheus and manual-data/lemma_overrides.yaml

The folowing are modules not called from the command-line:

morpheus.py (Morphology API client)
utils.py (common functions shared between scripts)

Other scripts include:

sort-yaml.py <filename> sorts the given yaml file with top-level keys in alphabetical order

License

The content is CC-BY-SA and the code is MIT.

Related Projects

argumentation-management

Annotator combining different NLP pipelines.

libermate

DEPRECATED use https://github.com/victorlei/smop instead LiberMate - A MATLAB to Python (SciPy/Nu...

23 Jan 2014 176

mtdata

A tool that locates, downloads, and extracts machine translation corpora

06 Apr 2020 139

TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extrac...

30 Jun 2013 8,914

augtxt

yet another text augmentation python package

attend-copy-parse

Code for the paper attend, copy, parse - End-to-end information extraction from documents (https:...

TransCoder

Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf

10 Jul 2020 1,688

texthero

Text preprocessing, representation and visualization from zero to hero.

06 Apr 2020 2,881

SmartLMVocabs

Improving Language Model Performance through Smart Vocabularies

jnlpba

Tools and resources related to the JNLPBA corpus

greek-inflexion

Python library for generating (and analyzing) Ancient Greek inflectional paradigms

Seq2Seq-Vis

Visualization for Sequential Neural Networks with Attention

16 May 2017 455

lachesis

lachesis automates the segmentation of a transcript into closed captions

openblas_buildsys_snips

Openblas build system snippets

language-dataset

Dataset for programming language identification.