A speech event detection and localization task for multiroom environments

Domestic environments are particularly challenging for distant speech recognition and audio processing in general. Reverberation, background noise and interfering sources, as well as the propagation of acoustic events across adjacent rooms, critically degrade the performance of standard speech processing algorithms. The DIRHA EU project addresses the development of distant-speech interaction with devices and services within the multiple rooms of typical apartments. A corpus of multichannel acoustic data has been created to represent realistic acoustic scenes, of different degrees of complexity, occurring in such an environment. It includes multichannel simulations based on measured impulse responses and real data collected in the same apartment. A basic but fundamental task of the front-end processing enabling effective ASR is the detection and localization of speech events generated by users, without constraints on their position or orientation within the various rooms. In this paper we describe the acoustic corpus and present a baseline approach to the joint task of speech detection and source localization, using speech related features such as pitch, combined with features derived from spatial coherence.

[1]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[2]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[3]  Maurizio Omologo,et al.  Impulse response estimation for robust speech recognition in a reverberant environment , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[4]  Alessio Brutti,et al.  A sequential Monte Carlo approach for tracking of overlapping acoustic sources , 2009, 2009 17th European Signal Processing Conference.

[5]  Renato De Mori,et al.  Spoken Dialogues with Computers , 1998 .

[6]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[7]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[8]  Roland Maas,et al.  Model-based dereverberation in the logmelspec domain for robust distant-talking speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Martin Wolf,et al.  Channel selection measures for multi-microphone speech recognition , 2014, Speech Commun..

[10]  Alessio Brutti,et al.  Speaker Localization in CHIL Lectures: Evaluation Criteria and Results , 2005, MLMI.

[11]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[12]  Petros Maragos,et al.  The DIRHA simulated corpus , 2014, LREC.

[13]  Maurizio Omologo,et al.  Use of a CSP-based voice activity detector for distant-talking ASR , 2003, INTERSPEECH.

[14]  Roland Maas,et al.  Reverberation Model-Based Decoding in the Logmelspec Domain for Robust Distant-Talking Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .

[16]  Andrey Temko,et al.  CLEAR Evaluation of Acoustic Event Detection and Classification Systems , 2006, CLEAR.