Using Contextual Spelling Correction to Improve Retrieval Effectiveness in Degraded Text Collections

The study presented relies on the design and evaluation of an improved IR system susceptible to cope with textual misspellings. After selecting an optimal weighting scheme for the engine, we evaluate the effect of misspellings on the retrieval effectiveness. Then, we compare the improvement brought to the engine by the adjunction of two different non-interactive spelling correction strategies: a classical one, based on a string-to-string edit distance calculus, and a contextual one, which adds linguistically-motivated features to the string distance module. The results for the latter suggest that average precision in degraded texts can be reduced to a few percents (4%).

[1]  Eneko Agirre,et al.  Towards a Single Proposal in Spelling Correction , 1998, COLING-ACL.

[2]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[3]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[4]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[7]  Tomek Strzalkowski,et al.  Natural Language Information Retrieval: TREC-8 Report , 1994, TREC.

[8]  W. B. Croft,et al.  An Evaluation of Information Retrieval Accuracy with Simulated OCR Output , 1993 .

[9]  Gerard Salton,et al.  Length Normalization in Degraded Text Collections , 1995 .

[10]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[11]  Robert H. Baud,et al.  Minimal Commitment and Full Lexical Disambiguation: Balancing Rules and Hidden Markov Models , 2000, CoNLL/LLL.

[12]  Akiko Aizawa,et al.  Reducing the Dimensions of Attributes by Selection and Aggregation , 1998, Discovery Science.

[13]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[14]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[15]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[16]  Kyo Kageura,et al.  Automatic Thesaurus Generation through Multiple Filtering , 2000, COLING.

[17]  Robert A. Greenes,et al.  Patient and Clinician Vocabulary: How Different Are They? , 2001, MedInfo.

[18]  Akiko Aizawa The feature quantity: an information theoretic perspective of Tfidf-like measures , 2000, SIGIR '00.

[19]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[20]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[21]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[24]  Smaranda Muresan,et al.  Evaluation of DEFINDER: a system to mine definitions from consumer-oriented medical text , 2001, JCDL '01.

[25]  Dan Roth,et al.  Applying Winnow to Context-Sensitive Spelling Correction , 1996, ICML.

[26]  Robert H. Baud,et al.  Toward filling the gap between interactive and fully-automatic spelling correction using the linguistic context , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).