Learning to Find Context Based Spelling Errors

A context-based spelling error is a spelling or typing error that turns an intended word into another word of the language. For example, the intended word “sight” might become the word “site.” A spell checker cannot identify such an error. In the English language—the case of interest here—a syntax checker may also fail to catch such an error since, among other reasons, the parts-of-speech of an erroneous word may permit an acceptable parsing. This chapter presents an effective method called Ltest for identifying the majority of context-based spelling errors. Ltest learns from prior, correct text how context-based spelling errors may manifest themselves, by purposely introducing such errors and analyzing the resulting text using a data mining algorithm. The output of this learning step consists of a collection of logic formulas that in some sense represent knowledge about possible context-based spelling errors. When, subsequently, testing text is examined for context-based spelling errors, the logic formulas and a portion of the prior text are used to analyze the case at hand and to pinpoint likely errors. Tests conducted on different text samples indicate that the method is effective for the recognition of the majority of context-based spelling errors; Ltest found 68% of context-based spelling errors in large texts and 87% of such errors in small texts. These detection rates are relative to words for which training was possible using the prior text.

[1]  Klaus Truemper,et al.  Effective Spell Checking by Learning User Behavior , 1999, Appl. Artif. Intell..

[2]  James H. Martin,et al.  Contextual Spelling Correction Using Latent Semantic Analysis , 1997, ANLP.

[3]  Ted Pedersen,et al.  Knowledge Lean Word-Sense Disambiguation , 1997, AAAI/IAAI.

[4]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[5]  Klaus Truemper,et al.  A method for controlling errors in two-class classification , 1999, Proceedings. Twenty-Third Annual International Computer Software and Applications Conference (Cat. No.99CB37032).

[6]  David St-Onge,et al.  Detecting and Correcting Malapropisms with Lexical Chains , 1995 .

[7]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[8]  Klaus Truemper,et al.  A MINSAT Approach for Learning in Logic Domains , 2002, INFORMS J. Comput..

[9]  Dan Roth,et al.  Learning to Resolve Natural Language Ambiguities: A Unified Approach , 1998, AAAI/IAAI.

[10]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[11]  Andrew R. Golding,et al.  A Bayesian Hybrid Method for Context-sensitive Spelling Correction , 1996, VLC@ACL.

[12]  Janyce Wiebe,et al.  Decomposable Modeling in Natural Language Processing , 1999, CL.

[13]  Ted Pedersen,et al.  Sequential Model Selection for Word Sense Disambiguation , 1997, ANLP.

[14]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[15]  Hirosato Nomura Intelligent Text Processing , 1996 .

[16]  Dan Roth,et al.  Applying Winnow to Context-Sensitive Spelling Correction , 1996, ICML.

[17]  Eric Brill,et al.  Automatic Rule Acquisition for Spelling Correction , 1997, ICML.

[18]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[19]  Ted Pedersen Search Techniques for Learning Probabilistic Models of Word Sense Disambiguation , 1999 .

[20]  Klaus Truemper,et al.  Identifying inadvertent semantic errors in english texts , 2000 .

[21]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[22]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[23]  David M. W. Powers Learning and Application of Differential Grammars , 1997, CoNLL.

[24]  Aravind K. Joshi,et al.  34th Annual Meeting of the Association for Computational Linguistics , 1996 .