DiSCo - A speaker and speech recognition evaluation corpus for challenging problems in the broadcast domain

Systems for speech and speaker recognition already achieve low error rates when applied to high-quality audiovisual broadcast data, such as news shows recorded in a studio environment. Several evaluation corpora exist for this domain in various languages. However, in actual applications for broadcast data analysis, the data requirements are more complex. There are many data types beyond the planned speech of the news anchorperson. For example, interesting live recordings from prominent politicians are often recorded in an environment with challenging acoustic properties. Discussions typically expose highly spontaneous speech, with different speakers talking at the same time. The performance of standard approaches to speech and speaker recognition typically deteriorates under such data characteristics, and dedicated techniques have to be developed to handle these problems. Corresponding evaluation corpora are needed which reflect the challenging conditions of the actual applications. Currently, no German evaluation corpus is available which covers the required acoustic conditions and diverse language properties. This contribution describes the design of a new speaker and speech recognition evaluation corpus for the broadcast domain, reflecting the typical problems encountered in actual applications.

[1]  Richard M. Stern,et al.  Sources of degradation of speech recognition in the telephone network , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Sadaoki Furui,et al.  Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance , 2008, Comput. Speech Lang..

[3]  Elizabeth Shriberg,et al.  The ICSI Meeting Recorder Dialog Act (MRDA) Corpus , 2004, SIGDIAL Workshop.

[4]  Richard M. Stern,et al.  The effects of background music on speech recognition accuracy , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Ji Ming Noise compensation for speech recognition with arbitrary additive noise , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Graeme D. Kennedy,et al.  Book Reviews: An Introduction to Corpus Linguistics , 1999, CL.

[7]  Larry P. Heck,et al.  Modeling dynamic prosodic variation for speaker verification , 1998, ICSLP.

[8]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[9]  Andreas Stolcke,et al.  Open-vocabulary spoken term detection using graphone-based hybrid recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[11]  Douglas A. Jones,et al.  Beyond Cepstra : Exploiting High-Level Information in Speaker Recognition , 2003 .

[12]  Andreas Stolcke,et al.  Development of the SRI/nightingale Arabic ASR system , 2008, INTERSPEECH.

[13]  P. Vanroose,et al.  BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION , 2003 .

[14]  Geoffrey Zweig,et al.  The IBM Mandarin Broadcast Speech Transcription System , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Philip C. Woodland Speaker adaptation for continuous density HMMs: a review , 2001 .

[16]  Edda Leopold Das Zipfsche Gesetz , 2002, Künstliche Intell..

[17]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[18]  Douglas A. Reynolds,et al.  The SuperSID project: exploiting high-level information for high-accuracy speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[19]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[20]  Khalid Choukri,et al.  SPEECHDAT-CAR. A Large Speech Database for Automotive Environments , 2000, LREC.

[21]  D.A. Reynolds,et al.  Large population speaker identification using clean and telephone speech , 1995, IEEE Signal Processing Letters.

[22]  Elizabeth Shriberg,et al.  Using prosodic and lexical information for speaker identification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Hynek Hermansky,et al.  Recognition of Reverberant Speech Using Frequency Domain Linear Prediction , 2008, IEEE Signal Processing Letters.

[24]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[25]  Patrick Wambacq,et al.  Speech recognition for subtitling purposes , 2004 .

[26]  Douglas A. Reynolds,et al.  Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS'02 , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[27]  Joachim Köhler,et al.  The MoveOn Motorcycle Speech Corpus , 2008, LREC.

[28]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..