GEC Scripts

A set of scripts for processing data for grammatical error correction.

Edits

edits/extract_edits.py - extract edits from parallel sentences
edits/calculate_stats.py - calculate edit statistics like error rates and a number of substitutions, deletions, insertions
edits/reduce_error_rate.py - remove correct sentences to get desired sentence error rate

Examples:

python extract_edits.py -d ' => ' < file.txt > edits.txt

Error patterns

patterns/extract_patterns.py - extract regex patterns from parallel sentences
patterns/filter_patterns.py - filter out edits that do not match referential patterns

Examples:

python extract_patterns.py < file.txt > patterns.txt
python filter_patterns.py -p patterns.txt -m 3 < corpus.txt > filtered.txt

Tokenization

nltk/nltk_tok.py - a wrapper for the NLTK tokenizer
nltk/nltk_detok.py - a NLTK-compatible detokenizer

Examples:

python nltk_tok.py -q -j8 < file.txt > file.tok.txt
python nltk_detok.py < file.tok.txt > file.txt

Related Projects

nlu_datasets

Datasets for intent classification and entity extraction including converters.

20 Nov 2018 5

Article-Summarizer

Uses frequency analysis to summarize text.

04 Jan 2017 183

lgpsi-processing

repo to set up processing scripts and data for LGPSI

04 Jul 2020 5

LM-Critic

[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

15 Sep 2021 114

pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/

24 Feb 2018 678

Gramformer

A framework for detecting, highlighting and correcting grammatical errors on natural language tex...

26 May 2021 1,502

TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extrac...

30 Jun 2013 8,914

preprocess_nlp

A fast framework for pre-processing (Cleaning text, Reduction of vocabulary, Feature extraction ...

04 Feb 2020 8

deep-text-corrector

Deep learning models trained to correct input errors in short, message-like text

07 Nov 2016 1,231

CYK-Parser

A CYK parser written in Python 3.

12 Dec 2016 37

augtxt

yet another text augmentation python package

22 Nov 2020 2

SmartLMVocabs

Improving Language Model Performance through Smart Vocabularies

22 Nov 2018 6

py3line

UNIX command-line tool for python line-based stream processing

22 Jul 2016 4

yle-corpus

Tools for working with the Yle corpus

10 Oct 2019 6

wikiedits

Automatic extraction of edited sentences from text edition histories.

12 Apr 2014 77

gec-scripts

GEC Scripts

Edits

Error patterns

Tokenization

Related Projects

nlu_datasets

Article-Summarizer

lgpsi-processing

LM-Critic

pyspellchecker

Gramformer

TextBlob

preprocess_nlp

deep-text-corrector

CYK-Parser

augtxt

SmartLMVocabs

py3line

yle-corpus

wikiedits