Exploring speech retrieval from meetings using the AMI corpus

Abstract Increasing amounts of informal spoken content are being collected, e.g. recordings of meetings, lectures and personal data sources. The amount of this content being captured and the difficulties of manually searching audio data mean that efficient automated search tools are of increasing importance if its full potential is to be realized. Much existing work on speech search has focused on retrieval of clearly defined document units in ad hoc search tasks. We investigate search of informal speech content using an extended version of the AMI meeting collection. A retrieval collection was constructed by augmenting the AMI corpus with a set of ad hoc search requests and manually identified relevant regions of the recorded meetings. Unlike standard ad hoc information retrieval focussing primarily on precision, we assume a recall-focused search scenario of a user seeking to retrieve a particular incident occurring within meetings relevant to the query. We explore the relationship between automatic speech recognition (ASR) accuracy, automated segmentation of the meeting into retrieval units and retrieval behaviour with respect to both precision and recall. Experimental retrieval results show that while averaged retrieval effectiveness is generally comparable in terms of precision for automatically extracted segments for manual content transcripts and ASR transcripts with high recognition accuracy, segments with poor recognition quality become very hard to retrieve and may fall below the retrieval rank position to which a user is willing search. These changes impact on system effectiveness for recall-focused search tasks. Varied ASR quality across the relevant and non-relevant data means that the rank of some well-recognized relevant segments is actually promoted for ASR transcripts compared to manual ones. This effect is not revealed by the averaged precision based retrieval evaluation metrics typically used for evaluation of speech retrieval. However such variations in the ranks of relevant segments can impact considerably on the experience of the user in terms of the order in which retrieved content is presented. Analysis of our results reveals that while relevant longer segments are generally more robust to ASR errors, and consequentially retrieved at higher ranks, this is often at the expense of the user needing to engage in longer content playback to locate the relevant content in the audio recording. Our overall conclusion being that it is desirable to minimize the length of retrieval units containing relevant content while seeking to maintain high ranking of these items.

[1]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[2]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[3]  Bhuvana Ramabhadran,et al.  Automatic recognition of spontaneous speech for access to multilingual oral history archives , 2004, IEEE Transactions on Speech and Audio Processing.

[4]  Mark Sanderson,et al.  Search of Spoken Documents Retrieves Well Recognized Transcripts , 2007, ECIR.

[5]  Gareth J. F. Jones,et al.  Automated Alignment and Annotation of Audio-Visual Presentations , 2002, ECDL.

[6]  Bernadette Sharp,et al.  Text segmentation of spoken meeting transcripts , 2008, Int. J. Speech Technol..

[7]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[8]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[9]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[10]  M. Sanderson,et al.  The relationship of word error rate to document ranking , 2003 .

[11]  Tatsuya Kawahara,et al.  Overview of the NTCIR-10 SpokenDoc-2 Task , 2013, NTCIR.

[12]  Mike Thelwall,et al.  Text Mining for Meeting Transcript Analysis to Extract Key Decision Elements , 2009 .

[13]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[14]  Martha Larson,et al.  Investigating the Global Semantic Impact of Speech Recognition Error on Spoken Content Collections , 2009, ECIR.

[15]  Hwee Tou Ng,et al.  Statistical lattice-based spoken document retrieval , 2010, TOIS.

[16]  Gareth J. F. Jones,et al.  Overview of the CLEF-2005 Cross-Language Speech Retrieval Track , 2005, CLEF.

[17]  Maurizio Rigamonti,et al.  FaericWorld: Browsing Multimedia Events Through Static Documents and Links , 2007, INTERACT.

[18]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[19]  Maria Eskevich,et al.  New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval , 2012, ECIR.

[20]  Christoph Meinel,et al.  Browsing within Lecture Videos Based on the Chain Index of Speech Transcription , 2008, IEEE Transactions on Learning Technologies.

[21]  Thomas Hain,et al.  Recognition and interpretation of meetings: The AMI and AMIDA projects , 2007 .

[22]  David Carmel,et al.  Spoken document retrieval from call-center conversations , 2006, SIGIR.

[23]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[24]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[25]  Johanna D. Moore,et al.  AUTOMATIC TOPIC SEGMENTATION AND LABELING IN MULTIPARTY DIALOGUE , 2006, 2006 IEEE Spoken Language Technology Workshop.

[26]  Tatsuya Kawahara,et al.  Overview of the IR for Spoken Documents Task in NTCIR-9 Workshop , 2011, NTCIR.

[27]  Andrei Popescu-Belis,et al.  Finding Information in Multimedia Meeting Records , 2012, IEEE MultiMedia.

[28]  H. Bourlard,et al.  Interpretation of Multiparty Meetings the AMI and Amida Projects , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[29]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[30]  Igor Malioutov,et al.  Minimum Cut Model for Spoken Lecture Segmentation , 2006, ACL.

[31]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[32]  Thomas Hain,et al.  Recognition and understanding of meetings the AMI and AMIDA projects , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[33]  Jaana Kekäläinen,et al.  Using graded relevance assessments in IR evaluation , 2002, J. Assoc. Inf. Sci. Technol..

[34]  Ellen M. Voorhees,et al.  Spoken Document Retrieval: 1998 Evaluation and Investigation of New Metrics , 1999 .

[35]  Karen Spärck Jones,et al.  Retrieving spoken documents by combining multiple index sources , 1996, SIGIR '96.

[36]  Andrei Popescu-Belis,et al.  A multimedia retrieval system using speech input , 2009, ICMI-MLMI '09.

[37]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[38]  Florian Metze,et al.  The "FAME" Interactive Space , 2005, MLMI.