Towards effective retrieval of spontaneous conversational spoken content

The continuing development in the technologies available for recording and storage of multimedia content means that the volume of archived digital material is growing rapidly. While some of it is formally structured and edited, increasing amounts of it are user generated and informal. We report an extensive investigation into effectiveness of speech search for challenging informally structured spoken content archives and the development of methods that address the identified challenges. We explore the relationship between automatic speech recognition (ASR) accuracy, automated segmentation of the informal content into semantically focused retrieval units and retrieval behaviour. We introduce new evaluation metrics designed to assess retrieval results according to different aspects of the user experience. Our studies concentrate on three types of data that contain natural conversations: lectures, meetings and Internet TV. Our experiments provide a deep understanding of the challenges and issues related to spoken content retrieval (SCR). For all these types of data, effective segmentation of the spoken content is demonstrated to significantly improve search effectiveness. SCR output consists of audio or video files, even if the system is based on their textual representation. Thus these result lists are difficult to browse through, since the user has to listen to the audio content or watch the video segments. Therefore, it is important to start the playback as close to the beginning of the relevant content (jump-in point) in a segment as possible. Based on our analysis of the issues relating to retrieval success and failure, we report a study of methods to improve retrieval effectiveness from the perspective of content ranking and access to relevant content in retrieved materials. The methods explored in this thesis examine alternative segmentation strategies, content expansion based on internal and external information sources, and exploration of the utilization of acoustic information corresponding to the ASR transcripts.

[1]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[2]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[3]  Alexander I. Rudnicky,et al.  Using the Amazon Mechanical Turk for transcription of spoken language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Gareth J. F. Jones,et al.  Overview of the CLEF-2005 Cross-Language Speech Retrieval Track , 2005, CLEF.

[5]  Mohammad Soleymani,et al.  Automatic tagging and geotagging in video collections and communities , 2011, ICMR.

[6]  James Surowiecki The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations Doubleday Books. , 2004 .

[7]  Martha Larson,et al.  Overview of MediaEval 2011 Rich Speech Retrieval Task and Genre Tagging Task , 2011, MediaEval.

[8]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[9]  Chris Callison-Burch,et al.  Creating Speech and Language Data With Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[10]  Ian R. Lane,et al.  Tools for Collecting Speech Corpora via Mechanical-Turk , 2010, Mturk@HLT-NAACL.

[11]  Jean-Luc Gauvain,et al.  Speech Processing for Audio Indexing , 2008, GoTAL.

[12]  Klaus Zechner,et al.  Using Amazon Mechanical Turk for Transcription of Non-Native Speech , 2010, Mturk@HLT-NAACL.

[13]  Matthew Lease,et al.  Crowdsourcing Document Relevance Assessment with Mechanical Turk , 2010, Mturk@HLT-NAACL.