A set of files used to train tesseract to read Georgian Mkhedruli script.
Steps I took to generate the training files
This file is based off the text available in the langdata repository, with the following manual modifications:
These files were generated using a database dump from Wikipedia roughly as follows:
Download latest Georgian database dump from Wikipedia: https://dumps.wikimedia.org/backup-index.html
Run WikiExtractor.py to extract the Georgian text
Concatenate output into a single file with find -type f <extraction_folder> | xargs cat > kawikitext.txt
Remove remaining tags with sed -i '/^<doc/ d'
and sed -i '/^<\/doc/ d'
Run python count_stuff/wordcounts.py --count-what [words|bigrams] --clean --no-counts kawikitext.txt > [kat.wordlist.clean|kat.word.bigrams.clean.full]
to extract words and/or bigrams from
the Wikipedia text
Run head -n 40000 kat.word.bigrams.clean.full > kat.word.bigrams.clean
in order to limit the
number of bigrams, which would otherwise be very large (~2 million)
I selected fonts that were freely licensed, and which included monospace, serif, and sans-serif fonts. In addition, there are several Georgian letters which can be written with different glyphs, so I made sure to include fonts which cover both glyphs (see here for details). A good selection of freely-licensed, Unicode Georgian fonts is available from BPG InfoTech. Other fonts are available in various places, but note that many commonly used Georgian fonts, such as AcadNusx and LitNusx, map Georgian glyphs onto Latin letters, making them unsuitable for automatically generating training images.
Tesseract was trained using tesstrain.sh without any modifications (except manual application of this patch).
The specific command executed to train tesseract was:
./tesstrain.sh \
--bin_dir /usr/local/bin/ \
--fonts_dir /usr/share/fonts/ \
--lang kat \
--langdata_dir /home/pi/tesseract/kat_train/staging/ \
--output_dir /home/pi/tesseract/kat_train/output/ \
--training_text /home/pi/tesseract/kat_train/staging/kat.training_text \
--wordlist /home/pi/tesseract/kat_train/staging/kat.wordlist.clean \
--tessdata_dir /usr/local/share/tessdata \
--fontlist "BPG Chkoni+BPG Chveulebrivi GPL&GNU+BPG Classic Medium,+BPG Courier GPL&GNU+BPG DedaEna+BPG Elite GPL&GNU+BPG Glaho GPL&GNU+BPG Glaho Traditional Arial+BPG Lia+BPG Rioni+Sylfaen"
(Yes, this was done on a Raspberry Pi.)
The count_stuff.py
script can theoretically also generate files containing punctuation and
numeral patterns, which tesstrain.sh can use to create DAWG files for punctuation and numbers.
However, I decided to forgo using these files in order to simplify the first pass at training, and
the results ended up being good enough that I haven't seen the need to add the punctuation and
number pattern files so far, so this feature of count_stuff.py may not work perfectly / at all.
Copyright 2015, Derek Dohler. I do not claim any copyright over kat.wordlist.clean or kat.word.bigrams.clean. I claim copyright over only the alterations which I made to kat.training_text, and not over the remainder of the file. Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0