Spell checker using Brill and Moore's noisy channel error model
APACHE-2.0 License
This is a Java implementation of the noisy channel spell checking approach presented in:
Brill and Moore (2000). An Improved Error Model for Noisy Channel Spelling Correction. In Proceedings of the ACL 2000.
The spell checker's error model is trained on a list of pairs of misspellings
with corrections, considering generic character edits up to a specified maximum
edit length (e.g., the edit ant
→ent
from the pair
dependant
→dependent
).
To use this spell checker you need:
The spell checker does not know anything about morphology or sentence-initial capitalization, so it expects all possible forms of a word (inflected, capitalized, lowercase, mixed case, etc.) to appear in the list of potential corrections. The command-line wrapper includes flags to expand a provided dictionary with lowercase and capitalized versions of all words.
$ mvn package
$ java -jar target/brillmoore-0.1-jar-with-dependencies.jar
usage: java -jar brillmoore-0.1-jar-with-dependencies.jar
-a,--minatoa <arg> minimum a -> a probability (default 0.8)
-c,--candidates <arg> number of candidates to output (default 10)
-d,--dict <arg> dictionary file
-h,--help this help message
-l,--lowercase expand dictionary with lowercase versions of all
words
-p,--train <arg> training file
-s,--single add training instances for all single character
edits
-t,--test <arg> testing file
-u,--capitalized expand dictionary with capitalized versions of
all words
-w,--window <arg> window for expanding alignments (Brill and
Moore's N; default 3)
Tab-separated values are used for input and output.
misspelling TAB target TAB count
word
word TAB probability
The output echoes the test input columns (misspelling, target, count) and appends the ranked candidate corrections as pairs of columns containing the candidate correction and the -log(prob) of the candidate.
misspelling TAB target TAB count TAB candidate1 TAB -log(prob1) TAB candidate2 TAB -log(prob2) ...
Sample input files based on the Aspell common misspellings test
data are provided in data/
. See
data/README.md
for details.
$ java -jar target/brillmoore-0.1-jar-with-dependencies.jar -d data/aspell-wordlist-en_USGBsGBz.70-1.txt -p data/aspell-common.train -t data/aspell-common.dev.first10 -c 3 > data/aspell-common.dev.first10.USGBsGBz.70-1.out
Sample output:
pumkin pumpkin 1 pumpkin 4.38 pumpkin's 6.67 bumkin 7.32
reorganision reorganisation 1 reorganisation 2.88 reorganisation's 5.20 reorganisations 7.09
gallaxies galaxies 1 galaxies 4.01 galaxy's 13.26 galaxy 17.45
superceeded superseded 1 superseded 7.91 supersede 14.46 succeeded 18.34
millenia millennia 1 millennia 2.11 millennial 6.23 millennial's 8.52
pseudonyn pseudonym 1 pseudonym 4.69 pseudonym's 6.98 pseudonyms 8.87
synonymns synonyms 1 synonyms 6.46 synonym's 8.29 synonym 12.49
prominant prominent 1 predominant 1.76 prominent 2.71 preeminent 10.01
manouver maneuver 1 maneuver 1.93 manoeuvre 3.76 maneuver's 4.27
obediance obedience 1 obedience 1.98 obedience's 4.33 obeisance 10.12
Evaluation for sample output:
$ data/eval.py data/aspell-common.dev.first10.USGBsGBz.70-1.out
NotFnd Found First 1-5 1-10 1-25 1-50 Any (Max: 3)
--------------------------------------------------------------------
0 10 90.0 100.0 100.0 100.0 100.0 100.0
Evaluation for the whole dev set output in
data/aspell-common.dev.USGBsGBz.70-1.out
considering the first 100
suggestions:
NotFnd Found First 1-5 1-10 1-25 1-50 Any (Max: 100)
----------------------------------------------------------------------
18 403 84.1 93.1 94.8 95.5 95.7 95.7
(Compare to: http://aspell.net/test/common-all/)
Evaluation with default paramemeters training on all Aspell common misspellings
(data/aspell-common.all
) and testing on Aspell current test data
(data/aspell-current.all
), which focuses on difficult misspellings:
NotFnd Found First 1-5 1-10 1-25 1-50 Any (Max: 100)
----------------------------------------------------------------------
43 504 56.3 78.4 83.7 88.8 91.2 92.1
(Compare to: http://aspell.net/test/cur/)
Note: some target corrections aren't found in the provided dictionary due to
capitalization (e.g., The
, muslims
) and run-on errors (incase
). The flags
-l
and -u
could be used to expand the base word list with lowercase and
capitalized versions respectively.
// create a list of pairs of misspellings and corrections
List<Misspelling> trainMisspellings = new ArrayList<>();
trainMisspellings.add(new Misspelling("Abril", "April", 1));
// create a dictionary
Map<String, Double> dict = new HashMap<>();
dict.put("April", 1.0);
dict.put("Arzt", 1.0);
dict.put("Altstadt", 1.0);
// set the parameters
int window = 3;
double minAtoA = 0.8;
try {
// train spell checker
SpellChecker spellchecker = new SpellChecker(trainMisspellings, dict, window, minAtoA);
// run spell checker
List<Candidate> candidates = spellchecker.getRankedCandidates("Abril");
// iterate over top ten candidates
for (Candidate cand : candidates.subList(0, Math.min(candidates.size(), 10))) {
System.out.println(cand.getTarget() + "\t" + cand.getProb());
}
} catch (ParseException e) {
System.err.println(e.getMessage());
}
April 1.6094379124341005
Altstadt Infinity
Arzt Infinity
Install in the local maven archive:
$ mvn install
Add the maven dependency:
<dependency>
<groupId>de.unituebingen.sfs</groupId>
<artifactId>brillmoore</artifactId>
<version>0.1</version>
</dependency>
This code includes modified versions of: