On Detection of Malapropisms by Multistage Collocation Testing

Malapropism is a (real-word) error in a text consisting in unintended replacement of one content word by another existing content word similar in sound but semantically incompatible with the context and thus destructing text cohesion, e.g.: they travel around the word. We present an algorithm of malapropism detection and correction based on evaluating the cohesion. As a measure of semantic compatibility of words we consider their ability to form syntactically linked and semantically admissible word combinations (collocations), e.g: travel (around the) world. With this, text cohesion at a content word is measured as the number of collocations it forms with the words in its immediate context. We detect malapropisms as words forming no collocations in the context. To test whether two words can form a collocation, we consider two types of resources: a collocation DB and an Internet search engine, e.g., Google. We illustrate the proposed method by classifying, tracing, and evaluating several English malapropisms.

[1]  Igor A. Bolshakov Multifunction Thesaurus For Russian Word Processing , 1994, ANLP.

[2]  Alexander F. Gelbukh,et al.  A Very Large Database of Collocations and Semantic Links , 2000, NLDB.

[3]  Alexander F. Gelbukh,et al.  Heuristics-Based Replenishment of Collocation Databases , 2002, PorTAL.

[4]  I. A. Bolshakov,et al.  Information Theories & Applications " Vol . 10 1 PARONYMS FOR ACCELERATED CORRECTION OF SEMANTIC ERRORS * , 2004 .

[5]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[6]  R. Schreuder,et al.  Idioms : structural and psychological perspectives , 1997 .

[7]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[8]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[9]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[10]  Alexander F. Gelbukh,et al.  Words Combinations as an Important Part of Modern Electronic Dictionaries , 2002, Proces. del Leng. Natural.

[11]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[12]  Eneko Agirre,et al.  Exploring Automatic Word Sense Disambiguation with Decision Lists and the Web , 2000, SAIC@COLING.

[13]  Chauncy D. Harris The New Encyclopaedia Britannica , 1975 .

[14]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[15]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[16]  Eneko Agirre,et al.  Integrating selectional preferences in WordNet , 2002, ArXiv.

[17]  Merrill D. Benson,et al.  The BBI Combinatory Dictionary of English , 1989 .

[18]  Philip W. Goetz The New Encyclopaedia Britannica , 1991 .

[19]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .