A more accurate spelling correction for the Ukrainian language.
When using a spell checker in systems that perform an automatic spelling correction without human verification, the following questions arise:
To address these issues, we propose a system that is compatible with any spell checker but focuses on precision over recall. We improve the accuracy of a spell checker by using these complimentary models:
sudo apt-get install python-dev
pip install speliuk
By default, Speliuk will use pre-trained models stored on Hugging Face.
>>> from speliuk.correct import Speliuk
>>> speliuk = Speliuk()
>>> speliuk.load()
>>> speliuk.correct("то він моее це зраабити для меніе?")
Correction(corrected_text='то він може це зробити для мене?', annotations=[Annotation(start=7, end=11, source_text='моее', suggestions=['може'], meta={}), Annotation(start=15, end=23, source_text='зраабити', suggestions=['зробити'], meta={}), Annotation(start=28, end=33, source_text='меніе', suggestions=['мене'], meta={})])
Speliuk can also be used directly from a spaCy model:
>>> import spacy
>>> from speliuk.correct import CorrectionPipe
>>> nlp = spacy.blank('uk')
>>> nlp.add_pipe('speliuk', config=dict(spacy_spelling_model_path='/my/custom/model'))
>>> doc = nlp("то він моее це зраабити для меніе?")
>>> doc._.speliuk_corrected
'то він може це зробити для мене?'
>>> doc.spans["speliuk_errors"]
[моее, зраабити, меніе]
To detect spelling errors, a spaCy NER model is used.
It was trained on a combination of synthetic and golden data:
We used KenLM for quick perplexity calculation. We used an existing model Yehor/kenlm-uk trained on UberText.
We used SymSpell for error correction. The dictionary consists of 500k most frequent words from the UberText corpus.