Effect of Recognition Errors on Text Clustering

This paper presents clustering experiments performed over noisy texts (i.e. texts that have been extracted through an automatic process like character or speech recognition). The effect of recognition errors is investigated by comparing clustering results performed over both clean (manually typed data) and noisy (automatic speech transcriptions) versions of the same speech recording corpus.

[1]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[2]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[3]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[4]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[5]  Tobun Dorbin Ng,et al.  Informedia at TRECVID 2003 : Analyzing and Searching Broadcast News Video , 2003, TRECVID.

[6]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[7]  Alessandro Vinciarelli,et al.  Noisy text categorization , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ellen M. Voorhees,et al.  Spoken Document Retrieval Track Slides , 2000, Text Retrieval Conference.

[9]  Robert Villa,et al.  The effectiveness of query-specific hierarchic clustering in information retrieval , 2002, Inf. Process. Manag..

[10]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Ellen M. Voorhees,et al.  The efficiency of inverted index and cluster searches , 1986, SIGIR '86.

[13]  Mark Liberman,et al.  THE TDT-2 TEXT AND SPEECH CORPUS , 1999 .

[14]  Robert Burgin,et al.  Performance Standards and Evaluations in IR Test Collections: Cluster-Based Retrieval Models , 1997, Inf. Process. Manag..

[15]  Jean-Marc Odobez,et al.  Spectral Structuring of Home Videos , 2003, CIVR.

[16]  Konstantinos Koumpis,et al.  Automatic summarization of voicemail messages using lexical and prosodic features , 2005, TSLP.

[17]  Jean-Marc Odobez,et al.  Text detection, recognition in images and video frames , 2004, Pattern Recognit..

[18]  Key-Sun Choi,et al.  Re-ranking model based on document clusters , 2001, Inf. Process. Manag..

[19]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[20]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Puming Zhan,et al.  Dragon systems' 1998 broadcast news transcription system , 1999, EUROSPEECH.

[22]  Michael G. Christel,et al.  Enhanced access to digital video through visually rich interfaces , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[23]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[24]  Steve Renals,et al.  Indexing and retrieval of broadcast news , 2000, Speech Commun..

[25]  W. Bruce Croft,et al.  Statistical language modeling for information retrieval , 2006, Annu. Rev. Inf. Sci. Technol..

[26]  Gurmeet Singh Manku,et al.  SETS: search enhanced by topic segmentation , 2003, SIGIR.

[27]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[28]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[29]  W. Bruce Croft,et al.  Searching Distributed Collections With Inference Networks , 2017, SIGF.

[30]  Alan F. Smeaton,et al.  Design, implementation and testing of an interactive video retrieval system , 2003, MIR '03.