A multi-channel corpus for distant-speech interaction in presence of known interferences

This paper describes a new corpus of multi-channel audio data designed to study and develop distant-speech recognition systems able to cope with known interfering sounds propagating in an environment. The corpus consists of both real and simulated signals and of a corresponding detailed annotation. An extensive set of speech recognition experiments was conducted using three different Acoustic Echo Cancellation (AEC) techniques to establish baseline results for future reference. The AEC techniques were applied both to single distant microphone input signals and beamformed signals generated using two state-of-the-art beamforming techniques. We show that the speech recognition performance using the different techniques is comparable for both the simulated and real data, demonstrating the usefulness of this corpus for speech research. We also show that a significant improvement in speech recognition performance can be obtained by combining state-of-the-art AEC and beamforming techniques, compared to using a single distant microphone input.

[1]  Maurizio Omologo,et al.  Speaker independent continuous speech recognition using an acoustic-phonetic Italian corpus , 1994, ICSLP.

[2]  Israel Cohen,et al.  Joint noise reduction and acoustic echo cancellation using the transfer-function generalized sidelobe canceller , 2007, Speech Commun..

[3]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[4]  Maurizio Omologo,et al.  COMPARISON BETWEEN SUBBAND AND FULLBAND NLMS FOR IN-CAR AUDIO COMPENSATION AND HANDS-FREE SPEECH RECOGNITION , 2005 .

[5]  Walter Kellermann Acoustic Echo Cancellation for Beamforming Microphone Arrays , 2001, Microphone Arrays.

[6]  Steve Renals,et al.  On the effect of snr and superdirective beamforming in speaker diarisation in meetings , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Ramón Fernández Astudillo,et al.  The DIRHA-GRID corpus: baseline and tools for multi-room distant speech recognition using distributed microphones , 2014, INTERSPEECH.

[8]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .

[9]  Erich Zwyssig,et al.  Speech processing using digital MEMS microphones , 2013 .

[10]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[11]  Maurizio Omologo,et al.  On the joint use of noise reduction and MLLR adaptation for in-car hands-free speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Iain McCowan,et al.  Segmenting multiple concurrent speakers using microphone arrays , 2003, INTERSPEECH.

[13]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[14]  Khalid Choukri,et al.  The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms , 2007, Lang. Resour. Evaluation.

[15]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[16]  Walter Kellermann,et al.  WOZ acoustic data collection for interactive TV , 2008, Lang. Resour. Evaluation.

[17]  Amy Neustein,et al.  Advances in Speech Recognition , 2010 .

[18]  Alfred Mertins,et al.  New aspects of combining echo cancellers with beamformers , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[19]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[20]  M. Omologo,et al.  IN-CAR AUDIO COMPENSATION BASED ON NLMS FOR HANDS-FREE SPEECH RECOGNITION , 2004 .

[21]  Jacob Benesty,et al.  Sparse Adaptive Filters for Echo Cancellation , 2010, Synthesis Lectures on Speech and Audio Processing.

[22]  Maurizio Omologo,et al.  Hidden Markov model training with contaminated speech material for distant-talking speech recognition , 2002, Comput. Speech Lang..

[23]  Amy Neustein Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics , 2010 .

[24]  Petros Maragos,et al.  The DIRHA simulated corpus , 2014, LREC.

[25]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[26]  Ted S. Wada,et al.  Batch-Online Semi-Blind Source Separation Applied to Multi-Channel Acoustic Echo Cancellation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Maurizio Omologo,et al.  Impulse response estimation for robust speech recognition in a reverberant environment , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[28]  J. Shynk Frequency-domain and multirate adaptive filtering , 1992, IEEE Signal Processing Magazine.

[29]  Maurizio Omologo,et al.  On the selection of the impulse responses for distant-speech recognition based on contaminated speech training , 2014, INTERSPEECH.

[30]  Walter Kellermann,et al.  Strategies for combining acoustic echo cancellation and adaptive beamforming microphone arrays , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Steve Young,et al.  The HTK book , 1995 .

[32]  Gerhard Schmidt,et al.  Acoustic echo control. An application of very-high-order adaptive filters , 1999, IEEE Signal Process. Mag..

[33]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[34]  Satoshi Nakamura,et al.  Joint optimization of LCMV beamforming and acoustic echo cancellation for automatic speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[35]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.