State-of-the-Art in Weighted Finite-State Spell-Checking

The following claims can be made about finite-state methods for spell-checking: 1 Finite-state language models provide support for morphologically complex languages that word lists, affix stripping and similar approaches do not provide; 2 Weighted finite-state models have expressive power equal to other, state-of-the-art string algorithms used by contemporary spell-checkers; and 3 Finite-state models are at least as fast as other string algorithms for lookup and error correction. In this article, we use some contemporary non-finite-state spell-checking methods as a baseline and perform tests in light of the claims, to evaluate state-of-the-art finite-state spell-checking methods. We verify that finite-state spell-checking systems outperform the traditional approaches for English. We also show that the models for morphologically complex languages can be made to perform on par with English systems.

[1]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[2]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[3]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[4]  Josef Raviv,et al.  Decision making in Markov chains applied to the problem of pattern recognition , 1967, IEEE Trans. Inf. Theory.

[5]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[6]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[7]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[8]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[9]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[10]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[11]  Agata Savary Typographical Nearest-Neighbor Search in a Finite-State Lexicon and Its Application to Spelling Correction , 2001, CIAA.

[12]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[13]  Lauri Karttunen,et al.  Finite State Morphology , 2003, CSLI Studies in Computational Linguistics.

[14]  Sebastian Deorowicz,et al.  Correcting Spelling Errors by Modelling Their Causes , 2005 .

[15]  Borivoj Melichar,et al.  Finding Common Motifs with Gaps Using Finite Automata , 2006, CIAA.

[16]  R. M. Díaz,et al.  On first-passage problems for asymmetric one-dimensional diffusions , 2007 .

[17]  Manuel Vilares Ferro,et al.  Contextual Spelling Correction , 2007, EUROCAST.

[18]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, & Tools with Gradiance , 2007 .

[19]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[20]  Meenu Bhagat,et al.  Spelling Error Pattern Analysis of Punjabi Typed Text , 2007 .

[21]  Jorge Graña,et al.  Contextual spelling correction , 2007 .

[22]  Graeme Hirst,et al.  Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[23]  Mans Hulden,et al.  Fast approximate string matching with finite automata , 2009 .

[24]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[25]  Mehryar Mohri,et al.  Weighted Automata Algorithms , 2009 .

[26]  Guillaume Wisniewski,et al.  Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History , 2022, LREC.

[27]  Tommi A. Pirinen,et al.  Creating and Weighting Hunspell Dictionaries as Finite-State Automata , 2010 .

[28]  Tommi A. Pirinen,et al.  Finite-State Spell-Checking with Weighted Language and Error Models : Building and Evaluating Spell-Checkers with Wikipedia as Corpus , 2010 .

[29]  Tommi A. Pirinen,et al.  Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams , 2012, CICLing 2012.

[30]  А. А. Блюдов,et al.  Двоичные коды с суммированием, имеющие минимальное число необнаруживаемых искажений информационных разрядов , 2012 .

[31]  Tommi A. Pirinen,et al.  Heuristic Hyper-minimization of Finite State Lexicons , 2014, LREC.

[32]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2015, Lecture Notes in Computer Science.