On the selection of the impulse responses for distant-speech recognition based on contaminated speech training

Distant-speech recognition represents a technology of fundamental importance for future development of assistive applications characterized by flexible and unobtrusive interaction in home environments. State-of-the-art speech recognition still exhibits lack of robustness, and an unacceptable performance variability, due to environmental noise, reverberation effects, and speaker position. In the past, multi-condition training and contamination methods were explored to reduce the mismatch between training and test conditions. However, the performance evaluation can be biased by factors as limited number of positions of speaker and microphones, adopted set of impulse responses, vocabulary and grammars defining the recognition task. The purpose of this paper is to investigate in more detail some critical aspects that characterize such experimental context. To this purpose, our work addressed a microphone network distributed over different rooms of an apartment and a related set of speaker-microphone pairs leading to a very large set of impulse responses. Besides simulations, the experiments also tackled real speech interactions. The performance evaluation was based on a phone-loop task, in order to minimize the influence of linguistic constraints. The experimental results show how less critical is an accurate selection of impulse responses, if compared to other factors as the signal-to-noise ratio introduced by additive background noise.

[1]  Yuki Denda,et al.  Investigations into early and late reflections on distant-talking speech recognition toward suitable reverberation criteria , 2007, INTERSPEECH.

[2]  Maurizio Omologo,et al.  Hidden Markov model training with contaminated speech material for distant-talking speech recognition , 2002, Comput. Speech Lang..

[3]  Arun Ross,et al.  Microphone Arrays , 2009, Encyclopedia of Biometrics.

[4]  Petros Maragos,et al.  The DIRHA simulated corpus , 2014, LREC.

[5]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[6]  Alessio Brutti,et al.  On the use of Early-To-Late Reverberation ratio for ASR in reverberant environments , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[8]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .

[9]  Steve Young,et al.  The HTK book , 1995 .

[10]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[11]  Martin Wolf,et al.  Channel selection measures for multi-microphone speech recognition , 2014, Speech Commun..

[12]  Maurizio Omologo,et al.  Impulse response estimation for robust speech recognition in a reverberant environment , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[13]  Michel Vacher,et al.  Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions , 2011, INTERSPEECH.

[14]  R. Maas,et al.  Towards a Better Understanding of the Effect of Reverberation on Speech Recognition Performance , 2010 .

[15]  Maurizio Omologo,et al.  Speaker independent continuous speech recognition using an acoustic-phonetic Italian corpus , 1994, ICSLP.

[16]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[17]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[18]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[19]  Alexander Fischer,et al.  Acoustic synthesis of training data for speech recognition in living room environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).