Information Retrieval on Noisy Text

Spoken Document Retrieval (SDR) consists in retrieving segments of a speech database that are relevant to a query. The state-of-the-art approach to the SDR problem consists in transcribing the speech data into digital text before applying common Information Retrieval (IR) techniques. The transcription, produced by an Automatic Speech Recognition system, contains recognition errors. These errors can be referred to as noise. This thesis investigates the effect of this noise on the retrieval process. We compare the results obtained with clean and noisy data at different steps of the retrieval process. To perform such a task, standard IR measures (precision, recall, break-even point, etc.) are used. It is shown that even with very different error rates (10% vs 30%), the performances obtained over noisy text are only slightly lower than those over clean text (9% degradation of average precision for our complete IR system, 45.2% vs 41.2%).

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[3]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[4]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[5]  K. Sparck Jones,et al.  Simple, proven approaches to text retrieval , 1994 .

[6]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[7]  Karen Sparck Jones,et al.  Spoken Document Retrieval for TREC-8 at Cambridge University , 1998, TREC.

[8]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[9]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[10]  H. P. Luhn,et al.  The automatic derivation of information retrieval encodements from machine-readable texts , 1997 .

[11]  Ross Wilkinson,et al.  Experiments in spoken document retrieval using phoneme n-grams , 2000, Speech Commun..

[12]  Steve Renals,et al.  The THISL SDR System At TREC-8 , 1999, TREC.

[13]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[14]  Jean-Luc Gauvain,et al.  The LIMSI SDR System for TREC-8 , 1999, TREC.

[15]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[16]  Gerard Salton,et al.  Document Length Normalization , 1995, Inf. Process. Manag..

[17]  Samy Bengio,et al.  Modeling human interaction in meetings , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[18]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[19]  Stephen E. Robertson,et al.  On Term Selection for Query Expansion , 1991, J. Documentation.

[20]  Ellen M. Voorhees,et al.  Spoken Document Retrieval: 1998 Evaluation and Investigation of New Metrics , 1999 .

[21]  Kenney Ng Towards robust methods for spoken document retrieval , 1998, ICSLP.

[22]  Steven P. Wartik Boolean Operations , 1992, Information Retrieval: Data Structures & Algorithms.

[23]  K. Sparck Jones,et al.  A Probabilistic Model of Information Retrieval : Development and Status , 1998 .

[24]  Padmini Srinivasan,et al.  Thesaurus Construction , 1992, Information Retrieval: Data Structures & Algorithms.

[25]  Iain McCowan,et al.  Segmenting multiple concurrent speakers using microphone arrays , 2003, INTERSPEECH.

[26]  Stephen E. Robertson,et al.  Experimentation as a way of life: Okapi at TREC , 2000, Inf. Process. Manag..

[27]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[28]  Mark Liberman,et al.  THE TDT-2 TEXT AND SPEECH CORPUS , 1999 .

[29]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[30]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[31]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[32]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[33]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[34]  Hervé Bourlard,et al.  Robust HMM-based speech/music segmentation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[36]  Steve Renals,et al.  Indexing and retrieval of broadcast news , 2000, Speech Commun..

[37]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[38]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[39]  Fabio Crestani,et al.  “Is this document relevant?…probably”: a survey of probabilistic models in information retrieval , 1998, CSUR.

[40]  Susan T. Dumais,et al.  Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval , 1990 .

[41]  Dragutin Petkovic,et al.  Spoken Document Retrieval , 2000 .