Automatic recognition of spontaneous speech for access to multilingual oral history archives

Much is known about the design of automated systems to search broadcast news, but it has only recently become possible to apply similar techniques to large collections of spontaneous speech. This paper presents initial results from experiments with speech recognition, topic segmentation, topic categorization, and named entity detection using a large collection of recorded oral histories. The work leverages a massive manual annotation effort on 10 000 h of spontaneous speech to evaluate the degree to which automatic speech recognition (ASR)-based segmentation and categorization techniques can be adapted to approximate decisions made by human annotators. ASR word error rates near 40% were achieved for both English and Czech for heavily accented, emotional and elderly spontaneous speech based on 65-84 h of transcribed speech. Topical segmentation based on shifts in the recognized English vocabulary resulted in 80% agreement with manually annotated boundary positions at a 0.35 false alarm rate. Categorization was considerably more challenging, with a nearest-neighbor technique yielding F=0.3. This is less than half the value obtained by the same technique on a standard newswire categorization benchmark, but replication on human-transcribed interviews showed that ASR errors explain little of that difference. The paper concludes with a description of how these capabilities could be used together to search large collections of recorded oral histories.

[1]  William J. Byrne,et al.  Large vocabulary ASR for spontaneous czech in the MALACH project , 2003, INTERSPEECH.

[2]  T. J. Watson IMPROVEMENTS IN ENGLISH ASR FOR THE MALACH PROJECT USING SYLLABLE-CENTRIC MODELS , 2003 .

[3]  Salim Roukos,et al.  Statistical methods for topic segmentation , 2000, INTERSPEECH.

[4]  Satya Dharanipragada,et al.  Segmentation and Detection at IBM , 2002 .

[5]  M. Crystal,et al.  Survivors of the Shoah Visual History Foundation: an introduction to its indexing methodology , 1998, The Indexer.

[6]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[7]  Bhuvana Ramabhadran,et al.  Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments , 2002, TSD.

[8]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[9]  William J. Byrne,et al.  On large vocabulary continuous speech recognition of highly inflectional language - czech , 2001, INTERSPEECH.

[10]  Martin Franz,et al.  Unsupervised and supervised clustering for topic tracking , 2001, SIGIR '01.

[11]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Mark Liberman,et al.  Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..

[13]  Amit Singhal,et al.  Document expansion for speech retrieval , 1999, SIGIR '99.

[14]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[15]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[16]  Michael Picheny,et al.  Robust methods for using context-dependent features and models in a continuous speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Geoffrey Zweig,et al.  Arc minimization in finite-state decoding graphs with cross-word acoustic context , 2004, Comput. Speech Lang..

[18]  Jing Huang,et al.  Impact of audio segmentation and segment clustering on automated transcription accuracy of large spoken archives , 2003, INTERSPEECH.

[19]  Salim Roukos,et al.  IBM's Statistical Question Answering System-TREC 11 , 2001, TREC.

[20]  M. Ostendorf,et al.  Using out-of-domain data to improve in-domain language models , 1997, IEEE Signal Processing Letters.

[21]  Jing Huang,et al.  Large vocabulary conversational speech recognition with the extended maximum likelihood linear transformation (EMLLT) model , 2002, INTERSPEECH.

[22]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[23]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[24]  Jing Huang,et al.  Towards automatic transcription of large spoken archives - English ASR for the MALACH project , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[25]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[26]  Jonathan G. Fiscus,et al.  Topic detection and tracking evaluation overview , 2002 .

[27]  Bhuvana Ramabhadran,et al.  Automated transcription and topic segmentation of large spoken archives , 2003, INTERSPEECH.

[28]  Martin Franz,et al.  Influence of speech recognition errors on topic detection (poster session) , 2000, SIGIR '00.

[29]  Douglas W. Oard,et al.  The Many Uses of Digitized Oral History Collections: Implications for Design , 2002 .

[30]  Geoffrey Zweig,et al.  An architecture for rapid decoding of large vocabulary conversational speech , 2003, INTERSPEECH.

[31]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[32]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[33]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[34]  Douglas W. Oard,et al.  Searching large collections of recorded speech: A preliminary study , 2005, ASIST.

[35]  Anton Leuski,et al.  Searching Recorded Speech Based on the Temporal Extent of Topic Labels , 2003 .

[36]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[37]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .