论文信息 - Revisiting N-Gram Based Models for Retrieval in Degraded Large Collections

Revisiting N-Gram Based Models for Retrieval in Degraded Large Collections

The traditional retrieval models based on term matching are not effective in collections of degraded documents (output of OCR or ASR systems for instance). This paper presents a n-gram based distributed model for retrieval on degraded text large collections. Evaluation was carried out with both the TREC Confusion Track and Legal Track collections showing that the presented approach outperforms in terms of effectiveness the classical term centred approach and the most of the participant systems in the TREC Confusion Track.

Alvaro Barreiro | Javier Parapar | Ana Freire

[1] W. Bruce Croft,et al. Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[2] Stuart Macdonald,et al. User Engagement in Research Data Curation , 2009, ECDL.

[3] Man Lung Yiu,et al. Group-by skyline query processing in relational engines , 2009, CIKM.

[4] José Luis Vicedo González,et al. TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[5] Donna K. Harman,et al. Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[6] Chris Buckley,et al. New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[7] Douglas W. Oard,et al. Overview of the TREC 2007 Legal Track , 2007, TREC.

[8] Ellen M. Voorhees,et al. TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[9] Eric C. Jensen,et al. Retr ieving OCR Text : A Survey of Current Approaches , 2002 .

[10] Ellen M. Voorhees,et al. The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[11] Ruxandra Domenig,et al. SPIDER Retrieval System at TREC-5 , 1996, TREC.

[12] Derrick Coetzee. TinyLex: static n-gram index pruning with perfect recall , 2008, CIKM '08.