Corrupted queries in Spanish text retrieval: error correction vs. N-Grams

In this paper, we propose and evaluate two different alternatives to deal with degraded queries on Spanish IR applications. The first one is an n-gram-based strategy which has no dependence on the degree of available linguistic knowledge. On the other hand, we propose two spelling correction techniques, one of which has a strong dependence on a stochastic model that must be previously built from a POS-tagged corpus. In order to study their validity, a testing framework has been formally designed and applied on both approaches.

[1]  Iadh Ounis,et al.  Automatically Building a Stopword List for an Information Retrieval System , 2005, J. Digit. Inf. Manag..

[2]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Susan T. Dumais,et al.  Improved string matching under noisy channel conditions , 2001, CIKM '01.

[5]  James Mayfield,et al.  JHU/APL Experiments in Tokenization and Non-Word Translation , 2003, CLEF.

[6]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[7]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[8]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[9]  Agata Savary Typographical Nearest-Neighbor Search in a Finite-State Lexicon and Its Application to Spelling Correction , 2001, CIAA.

[10]  Patrick Ruch Using Contextual Spelling Correction to Improve Retrieval Effectiveness in Degraded Text Collections , 2002, COLING.

[11]  Werner Winiwarter,et al.  Exploiting syntactic analysis of queries for information retrieval , 2002, Data Knowl. Eng..

[12]  Manuel Vilares Ferro,et al.  Contextual Spelling Correction , 2007, EUROCAST.

[13]  Manuel Vilares Ferro,et al.  On Asymptotic Finite-State Error Repair , 2004, SPIRE.

[14]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[15]  Miguel A. Alonso,et al.  A Common Solution for Tokenization and Part-of-Speech Tagging , 2002, TSD.

[16]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[17]  Werner Winiwarter,et al.  A simple way of improving traditional IR methods by structuring queries , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[18]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[19]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[20]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[21]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[22]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.