论文信息 - Corrupted queries in Spanish text retrieval: error correction vs. N-Grams

Corrupted queries in Spanish text retrieval: error correction vs. N-Grams

In this paper, we propose and evaluate two different alternatives to deal with degraded queries on Spanish IR applications. The first one is an n-gram-based strategy which has no dependence on the degree of available linguistic knowledge. On the other hand, we propose two spelling correction techniques, one of which has a strong dependence on a stochastic model that must be previously built from a POS-tagged corpus. In order to study their validity, a testing framework has been formally designed and applied on both approaches.

Manuel Vilares Ferro | Jesús Vilares | Juan Otero Pombo

[1] Iadh Ounis,et al. Automatically Building a Stopword List for an Information Retrieval System , 2005, J. Digit. Inf. Manag..

[2] James Mayfield,et al. Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[3] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[4] Susan T. Dumais,et al. Improved string matching under noisy channel conditions , 2001, CIKM '01.

[5] James Mayfield,et al. JHU/APL Experiments in Tokenization and Non-Word Translation , 2003, CLEF.

[6] Van Nostrand,et al. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[7] Kazem Taghva,et al. Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[8] Fred J. Damerau,et al. A technique for computer detection and correction of spelling errors , 1964, CACM.

[9] Agata Savary. Typographical Nearest-Neighbor Search in a Finite-State Lexicon and Its Application to Spelling Correction , 2001, CIAA.

[10] Patrick Ruch. Using Contextual Spelling Correction to Improve Retrieval Effectiveness in Degraded Text Collections , 2002, COLING.

[11] Werner Winiwarter,et al. Exploiting syntactic analysis of queries for information retrieval , 2002, Data Knowl. Eng..

[12] Manuel Vilares Ferro,et al. Contextual Spelling Correction , 2007, EUROCAST.

[13] Manuel Vilares Ferro,et al. On Asymptotic Finite-State Error Repair , 2004, SPIRE.

[14] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[15] Miguel A. Alonso,et al. A Common Solution for Tokenization and Part-of-Speech Tagging , 2002, TSD.

[16] Kristina Toutanova,et al. Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[17] Werner Winiwarter,et al. A simple way of improving traditional IR methods by structuring queries , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[18] C. J. van Rijsbergen,et al. Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[19] Kemal Oflazer,et al. Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[20] Kenneth Ward Church,et al. A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[21] Eric Brill,et al. Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[22] Eric Brill,et al. An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.