Extraction of Spelling Variations from Language Structure for Noisy Text Correction

We describe a novel approach for the extraction of spelling variations from a list of instances. It relates distinctive infixes to distinctive infixes of referenced words. The distinctive infixes are extracted automatically from a (multi)set of instances and a referenced dictionary without any additional expert knowledge. Based on the spelling variations retrieved during a learning(training) phase we develop a correction algorithm which suggests and ranks candidates for a particular noisy word. The main advantage of our approach is that it provides good corrections for the unobserved noisy words while it is almost perfect on words observed during the learning. Our experimental results of the normalisation of a typical reference corpus of Early Modern English letters, [1], significantly improve over previous results of VARD2, [2]. We also achieve better results than those reported in [3] and [4] on the OCR-correction of the TREC-5 Confusion Track corpus,[5].

[1]  Mehryar Mohri,et al.  An efficient algorithm for the n-best-strings problem , 2002, INTERSPEECH.

[2]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[3]  David Haussler,et al.  Complete inverted files for efficient text retrieval and analysis , 1987, JACM.

[4]  Shourya Roy,et al.  Unsupervised learning of multilingual short message service (SMS) dialect from noisy examples , 2008, AND '08.

[5]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[6]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[7]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[8]  Ulrich Reffle Efficiently generating correction suggestions for garbled tokens of historical language , 2011, Nat. Lang. Eng..

[9]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[10]  Klaus U. Schulz,et al.  Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens , 2007, Australian Conference on Artificial Intelligence.

[11]  Klaus U. Schulz,et al.  Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[12]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[13]  Peter Nabende,et al.  Applying dynamic Bayesian networks in transliteration detection and generation , 2011 .

[14]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Klaus U. Schulz,et al.  Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks , 2007 .

[16]  Klaus U. Schulz,et al.  Successfully detecting and correcting false friends using channel profiles , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[17]  Paul Rayson,et al.  Automatic standardisation of texts containing spelling variation: How much training data do you need? , 2009 .