Recherche d'information dans un corpus bruité (OCR)

This paper evaluates the retrieval effectiveness degradation when facing with noisy text corpus. With the use of a test-collection having the clean text, another version with around 5% error rate in recognition and a third with 20% error rate, we have evaluated six IR models based on three text representations (bag-of-words, n-grams, trunc-n) as well as three stemmers. Using the mean reciprocal rank as performance measure, we show that the average retrieval effectiveness degradation is around -17% when dealing with an error rate of 5%. This average decrease is around -46% when facing with an error rate of 20%. The representation by 4-grams tends to offer the best retrieval when searching with noisy text. Finally, we are not able to obtain clear conclusion about the impact of different stemming strategies or the use of blind-query expansion. MOTS-CLÉS : Recherche d'information dans des documents bruités (OCR), évaluation, TREC.

[1]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[4]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[5]  Ruxandra Domenig,et al.  SPIDER Retrieval System at TREC-5 , 1996, TREC.

[6]  Kazem Taghva,et al.  Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[7]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[8]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[9]  Stephen E. Robertson,et al.  Experimentation as a way of life: Okapi at TREC , 2000, Inf. Process. Manag..

[10]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[11]  Paul B. Kantor,et al.  Information retrieval and OCR: from converting content to grasping meaning , 2002, SIGF.

[12]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[13]  Ellen M. Voorhees,et al.  Retrieving Noisy Text , 2004 .

[14]  Kazem Taghva,et al.  Information access in the presence of OCR errors , 2004, HDP '04.

[15]  Peter Schäuble,et al.  Information Retrieval can Cope with Many Errors , 2000, Information Retrieval.

[16]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[17]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[18]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[19]  Mandar Mitra,et al.  Information Retrieval from Documents: A Survey , 2000, Information Retrieval.

[20]  Ellen M. Voorhees,et al.  Retrieval System Evaluation , 2005 .

[21]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[22]  Jacques Savoy,et al.  Un regard statistique sur l'évaluation de performance : L'exemple de CLEF 2005 , 2006, CORIA.

[23]  Norbert Fuhr,et al.  Rule-based Search in Text Databases with Nonstandard Orthography , 2006, Lit. Linguistic Comput..

[24]  Jacques Savoy,et al.  Considérations sur l'évaluation de la robustesse en recherche d'information , 2007, CORIA.

[25]  Ellen M. Voorhees,et al.  TREC: Continuing information retrieval's tradition of experimentation , 2007, CACM.

[26]  Jacques Savoy,et al.  Variations autour de tf idf et du moteur Lucene , 2008 .

[27]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[28]  Venu Govindaraju,et al.  Handwritten document retrieval strategies , 2009, AND '09.

[29]  Jacques Savoy,et al.  Représentation comparative. Application au discours électoral en Suisse, France et États-Unis , 2010, Document Numérique.

[30]  Jacques Savoy,et al.  Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages , 2010, TALIP.