Finite-State Spell-Checking with Weighted Language and Error Models : Building and Evaluating Spell-Checkers with Wikipedia as Corpus

In this paper we present simple methods for construction and evaluation of finite-state spell-checking tools using an existing finite-state lexical automaton, freely available finite-state tools and Internet corpora acquired from projects such as Wikipedia. As an example, we use a freely available open-source implementation of Finnish morphology, made with traditional finite-state morphology tools, and demonstrate rapid building of Northern Sámi and English spell checkers from tools and resources available from the Internet.

[1]  Mans Hulden,et al.  Fast approximate string matching with finite automata , 2009 .

[2]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[3]  Lauri Karttunen,et al.  Finite State Morphology , 2003, CSLI Studies in Computational Linguistics.

[4]  Tommi A. Pirinen,et al.  HFST Tools for Morphology - An Efficient Open-Source Package for Construction of Morphological Analyzers , 2009, SFCM.

[5]  Roger Mitton Ordering the suggestions of a spellchecker without using context , 2009, Nat. Lang. Eng..

[6]  Tommi A. Pirinen,et al.  Weighting finite-state morphological analyzers using HFST tools , 2009 .

[7]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[8]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  Johnny Bigert,et al.  Automatic and unsupervised methods in natural language processing , 2005 .

[11]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[12]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[13]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[14]  Johnny Bigert,et al.  AutoEval and Missplel: Two Generic Tools for Automatic Evaluation , 2003 .