cltk_readers

Corpus reader extension for the Classical Language Toolkit

MIT License

Stars
6

CLTK Readers

A corpus-reader extension for CLTK

Version 0.6.8; tested on Python 3.10.8, CLTK 1.1.5; LatinCy 3.7.6

Installation

pip install -e git+https://github.com/diyclassics/cltk_readers.git#egg=cltk_readers

Usage

>>> from cltkreaders.lat import LatinTesseraeCorpusReader
>>> tess = LatinTesseraeCorpusReader()
>>> print(tess.fileids())
['ammianus.rerum_gestarum.part.14.tess', 'ammianus.rerum_gestarum.part.15.tess', 'ammianus.rerum_gestarum.part.16.tess', 'ammianus.rerum_gestarum.part.17.tess', ...]
>>> print(next(tess.tokenized_sents('vergil.aeneid.part.1.tess', simple=True)))
['Arma', 'virumque', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris', 'Italiam', ',', 'fato', 'profugus', ',', 'Laviniaque', 'venit', 'litora', ',', 'multum', 'ille', 'et', 'terris', 'iactatus', 'et', 'alto', 'vi', 'superum', 'saevae', 'memorem', 'Iunonis', 'ob', 'iram', ';']

Corpora supported (so far!)

Change log

  • 0.6.8: Add parameter to chunks method to allow for punctuation to be include/not included in chunking
  • 0.6.7: Add no annotations parameter to spacy_docs for LatinTesseraeCorpusReader
  • 0.6.6: Add root parameter to LatinTesseraeCorpusReader
  • 0.6.5: Add fileid selector support for pipe (|) delimited metadata
  • 0.6.4: Bump spaCy version
  • 0.6.3: Update fileid selector for Greek corpus readers
  • 0.6.2: Add LatinCy support for LatinPerseusCorpusReader
  • 0.6.1: Miscellaneous fixes to reader, fileid selector
  • 0.6.0: Introduce metadata-based fileid selector
  • 0.5.6: Bump spaCy version
  • 0.5.5: Update CSEL reader; Update spaCy dependency to LatinCy lg model
  • 0.5.4: Update spaCy dependency to LatinCy md model
  • 0.5.3: Update spaCy dependency to md model
  • 0.5.2: Minor fixes
  • 0.5.1: Fix spaCy model installation
  • 0.5.0: Update packaging for PyPI
  • 0.4.6: Add simple parameter to Tesserae tokenized_sents; add pos_sents to Tesserae; update demo notebook
  • 0.4.5: Update spaCy dependency to la_dep_cltk_sm-0.2.0
  • 0.4.4: Add support for Camena
  • 0.4.3: Add support for Open Greek & Latin CSEL files
  • 0.4.2: Update lxml; also update spaCy dependency (now to main spaCy project, as of v. 3.4.2)
  • 0.4.1: Update spaCy dependency
  • 0.4.0: Add support for Latin Library (and similar plaintext collections)
  • 0.3.0: Add support for Perseus-style TEI/XML files; add Latin spaCy support for lemmatization and POS tagging
  • 0.2.4: Add support for Universal Dependencies files
  • 0.2.3: Add support for Perseus AGLDT Treebanks

Coded 2022-2024 by Patrick J. Burns