A set of scripts for processing data for grammatical error correction
MIT License
A set of scripts for processing data for grammatical error correction.
edits/extract_edits.py
- extract edits from parallel sentencesedits/calculate_stats.py
- calculate edit statistics like error rates and a number of substitutions, deletions, insertionsedits/reduce_error_rate.py
- remove correct sentences to get desired sentence error rateExamples:
python extract_edits.py -d ' => ' < file.txt > edits.txt
patterns/extract_patterns.py
- extract regex patterns from parallel sentencespatterns/filter_patterns.py
- filter out edits that do not match referential patternsExamples:
python extract_patterns.py < file.txt > patterns.txt
python filter_patterns.py -p patterns.txt -m 3 < corpus.txt > filtered.txt
nltk/nltk_tok.py
- a wrapper for the NLTK tokenizernltk/nltk_detok.py
- a NLTK-compatible detokenizerExamples:
python nltk_tok.py -q -j8 < file.txt > file.tok.txt
python nltk_detok.py < file.tok.txt > file.txt