Improvements to Korektor: A Case Study with Native and Non-Native Czech

We present recent developments of Korektor, a statistical spell checking system. In addition to lexicon, Korektor uses language models to find real-word errors, detectable only in context. The models and error proba- bilities, learned from error corpora, are also used to sug- gest the most likely corrections. Korektor was originally trained on a small error corpus and used language models extracted from an in-house corpus WebColl. We show two recent improvements: • We built new language models from freely avail- able (shuffled) versions of the Czech National Cor- pus and show that these perform consistently better on texts produced both by native speakers and non- native learners of Czech. • We trained new error models on a manually annotated learner corpus and show that they perform better than the standard error model (in error detection) not only for the learners' texts, but also for our standard eval- uation data of native Czech. For error correction, the standard error model outperformed non-native mod- els in 2 out of 3 test datasets. We discuss reasons for this not-quite-intuitive improve- ment. Based on these findings and on an analysis of errors in both native and learners' Czech, we propose directions for further improvements of Korektor.

[1]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[2]  Marcin Junczys-Dowmunt,et al.  The AMU System in the CoNLL-2014 Shared Task: Grammatical Error Correction by Data-Intensive and Feature-Rich Statistical Machine Translation , 2014, CoNLL Shared Task.

[3]  Norton Trevisan Roman,et al.  Spelling Error Patterns in Brazilian Portuguese , 2015, Computational Linguistics.

[4]  Lung-Hao Lee,et al.  Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013 , 2013, SIGHAN@IJCNLP.

[5]  Dan Roth,et al.  Building a State-of-the-Art Grammatical Error Correction System , 2014, TACL.

[6]  Jaroslava Hlaváčová,et al.  SYN2005: balanced corpus of written Czech , 2005 .

[7]  Ben Hutchinson,et al.  Using the Web for Language Independent Spellchecking and Autocorrection , 2009, EMNLP.

[8]  Alexandr Rosen,et al.  Error-Tagged Learner Corpus of Czech , 2010, Linguistic Annotation Workshop.

[9]  Alexandr Rosen,et al.  Combining Manual and Automatic Annotation of a Learner Corpus , 2012, TSD.

[10]  Alexandr Rosen,et al.  Korektor – A System for Contextual Spell-Checking and Diacritics Completion , 2012, COLING.

[11]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[12]  Nizar Habash,et al.  The Illinois-Columbia System in the CoNLL-2014 Shared Task , 2014, CoNLL Shared Task.

[13]  Anna Feldman,et al.  Evaluating and automating the annotation of a learner corpus , 2013, Language Resources and Evaluation.

[14]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[15]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.