论文信息 - Successfully detecting and correcting false friends using channel profiles - 字舞流文

Successfully detecting and correcting false friends using channel profiles

The detection and correction of false friends—also called real-word errors—is a notoriously difficult problem. On realistic data, the break-even point for automatic correction so far could not be reached: the number of additional infelicitous corrections outnumbered the useful corrections. We present a new approach where we first compute a profile of the error channel for the given text. During the correction process, the profile (1) helps to restrict attention to a small set of “suspicious” lexical tokens of the input text where it is “plausible” to assume that the token represents a false friend. In this way, recognition of false friends is improved. Furthermore, the profile (2) helps to isolate the “most promising” correction suggestion for “suspicious” tokens. Using a conventional word trigram statistics for disambiguation we obtain a correction method that can be successfully applied to unrestricted text. In experiments for OCR documents, we show significant accuracy gains by fully automatic correction of false friends.

Klaus U. Schulz | Ulrich Reffle | Christoph Ringlstetter | Annette Gotscharek

[1] Andrew R. Golding,et al. A Bayesian Hybrid Method for Context-sensitive Spelling Correction , 1996, VLC@ACL.

[2] David Yarowsky,et al. DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.

[3] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.

[4] David Yarowsky,et al. Discrimination Decisions for 100,000-Dimensional Spaces , 1995 .

[5] Alexander F. Gelbukh,et al. On Detection of Malapropisms by Multistage Collocation Testing , 2003, NLDB.

[6] Roger Mitton,et al. Spelling checkers, spelling correctors and the misspellings of poor spellers , 1987, Inf. Process. Manag..

[7] Klaus U. Schulz,et al. Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens , 2007, Australian Conference on Artificial Intelligence.

[8] Graeme Hirst,et al. Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[9] Klaus U. Schulz,et al. Deriving Symbol Dependent Edit Weights for Text Correction_The Use of Error Dictionaries , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[10] H. Kucera,et al. Computational analysis of present-day American English , 1967 .

[11] Klaus U. Schulz,et al. Successfully detecting and correcting false friends using channel profiles , 2008, AND '08.

[12] Dan Roth,et al. A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[13] Daniel P. Lopresti. Performance evaluation for text processing of noisy inputs , 2005, SAC '05.

[14] Martin Reynaert,et al. All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation , 2008, LREC.

[15] Kazem Taghva,et al. The Effects of Noisy Data on Text Retrieval , 1994, J. Am. Soc. Inf. Sci..

[16] Robert L. Mercer,et al. Context based spelling correction , 1991, Inf. Process. Manag..

[17] Rainer Hoch,et al. TECHNIQUES FOR IMPROVING OCR RESULTS , 1997 .

[18] Graeme Hirst,et al. Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[19] Ellen M. Voorhees,et al. The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[20] Kazem Taghva,et al. Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model , 1996, Inf. Process. Manag..