The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments

We present an overview of the data collection and transcription efforts for the COnversational Speech In Noisy Environments (COSINE) corpus. The corpus is a set of multi-party conversations recorded in real world environments with background noise. It can be used to train noise-robust speech recognition systems or develop speech de-noising algorithms. We explain the motivation for creating such a corpus, and describe the resulting audio recordings and transcriptions that comprise the corpus. These high quality recordings were captured in situ on a custom wearable recording system, whose design and construction is also described. On separate synchronized audio channels, seven-channel audio is captured with a 4-channel far-field microphone array, along with a close-talking, a monophonic far-field, and a throat microphone. This corpus thus creates many possibilities for speech algorithm research.

[1]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[2]  Li Deng,et al.  Uncertainty decoding with SPLICE for noise robust speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[4]  Jeff A. Bilmes,et al.  Noise robustness in automatic speech recognition , 2004 .

[5]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[7]  Andreas Stolcke,et al.  Joint Segmentation and Classification of Dialog Acts in Multiparty Meetings , 2005, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Jean-Claude Junqua,et al.  The Lombard effect: a reflex to better communicate with others in noise , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Jeff A. Bilmes,et al.  Virtual Evidence for Training Speech Recognizers Using Partially Labeled Data , 2007, HLT-NAACL.

[10]  Elaine Marsh,et al.  Speech in noisy environments (spine) adds new dimension to speech recognition R&D , 2002 .

[11]  Yi Hu,et al.  Subjective comparison and evaluation of speech enhancement algorithms , 2007, Speech Commun..

[12]  Elizabeth Shriberg,et al.  Spontaneous speech: how people really talk and why engineers should care , 2005, INTERSPEECH.

[13]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Thomas M. Sullivan,et al.  Multi-microphone correlation-based processing for robust automatic speech recognition , 1996 .

[15]  Jeff A. Bilmes,et al.  Applications of virtual-evidence based speech recognizer training , 2008, INTERSPEECH.

[16]  Richard M. Stern,et al.  Microphone array processing for robust speech recognition , 2003 .

[17]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Jeff A. Bilmes,et al.  COSINE - A corpus of multi-party COnversational Speech In Noisy Environments , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  José B. Mariño,et al.  Albayzin speech database: design of the phonetic corpus , 1993, EUROSPEECH.

[20]  Hiroshi Matsumoto,et al.  An improved mel-wiener filter for mel-LPC based speech recognition , 2005, INTERSPEECH.

[21]  Jeff A. Bilmes,et al.  Uncertainty in training large vocabulary speech recognizers , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[22]  Liqing Zhang,et al.  An Auditory Neural Feature Extraction Method for Robust Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[23]  Kazuya Takeda,et al.  Construction of speech corpus in moving car environment , 2000, INTERSPEECH.

[24]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[25]  David Heckerman,et al.  Models and Selection Criteria for Regression and Classification , 1997, UAI.

[26]  Nam Soo Kim,et al.  Cepstral domain feature compensation based on diagonal approximation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Sara H. Basson,et al.  NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[28]  Yifan Gong,et al.  A minimum-mean-square-error noise reduction algorithm on Mel-frequency cepstra for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..