Successfully detecting and correcting false friends using channel profiles

The detection and correction of false friends—also called real-word errors—is a notoriously difficult problem. On realistic data, the break-even point for automatic correction so far could not be reached: the number of additional infelicitous corrections outnumbered the useful corrections. We present a new approach where we first compute a profile of the error channel for the given text. During the correction process, the profile (1) helps to restrict attention to a small set of “suspicious” lexical tokens of the input text where it is “plausible” to assume that the token represents a false friend. In this way, recognition of false friends is improved. Furthermore, the profile (2) helps to isolate the “most promising” correction suggestion for “suspicious” tokens. Using a conventional word trigram statistics for disambiguation we obtain a correction method that can be successfully applied to unrestricted text. In experiments for OCR documents, we show significant accuracy gains by fully automatic correction of false friends.

[1]  Andrew R. Golding,et al.  A Bayesian Hybrid Method for Context-sensitive Spelling Correction , 1996, VLC@ACL.

[2]  David Yarowsky,et al.  DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.

[3]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[4]  David Yarowsky,et al.  Discrimination Decisions for 100,000-Dimensional Spaces , 1995 .

[5]  Alexander F. Gelbukh,et al.  On Detection of Malapropisms by Multistage Collocation Testing , 2003, NLDB.

[6]  Roger Mitton,et al.  Spelling checkers, spelling correctors and the misspellings of poor spellers , 1987, Inf. Process. Manag..

[7]  Klaus U. Schulz,et al.  Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens , 2007, Australian Conference on Artificial Intelligence.

[8]  Graeme Hirst,et al.  Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[9]  Klaus U. Schulz,et al.  Deriving Symbol Dependent Edit Weights for Text Correction_The Use of Error Dictionaries , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[10]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[11]  Klaus U. Schulz,et al.  Successfully detecting and correcting false friends using channel profiles , 2008, AND '08.

[12]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[13]  Daniel P. Lopresti Performance evaluation for text processing of noisy inputs , 2005, SAC '05.

[14]  Martin Reynaert,et al.  All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation , 2008, LREC.

[15]  Kazem Taghva,et al.  The Effects of Noisy Data on Text Retrieval , 1994, J. Am. Soc. Inf. Sci..

[16]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[17]  Rainer Hoch,et al.  TECHNIQUES FOR IMPROVING OCR RESULTS , 1997 .

[18]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[19]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[20]  Kazem Taghva,et al.  Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model , 1996, Inf. Process. Manag..