Contaminated speech training methods for robust DNN-HMM distant speech recognition

Despite the significant progress made in the last years, state-of-the-art speech recognition technologies provide a satisfactory performance only in the close-talking condition. Robustness of distant speech recognition in adverse acoustic conditions, on the other hand, remains a crucial open issue for future applications of human-machine interaction. To this end, several advances in speech enhancement, acoustic scene analysis as well as acoustic modeling, have recently contributed to improve the state-of-the-art in the field. One of the most effective approaches to derive a robust acoustic modeling is based on using contaminated speech, which proved helpful in reducing the acoustic mismatch between training and testing conditions. In this paper, we revise this classical approach in the context of modern DNN-HMM systems, and propose the adoption of three methods, namely, asymmetric context windowing, close-talk based supervision, and close-talk based pre-training. The experimental results, obtained using both real and simulated data, show a significant advantage in using these three methods, overall providing a 15% error rate reduction compared to the baseline systems. The same trend in performance is confirmed either using a high-quality training set of small size, and a large one.

[1]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[4]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Gerhard Schmidt,et al.  Speech and Audio Processing in Adverse Environments , 2008 .

[8]  Roberto Gretter,et al.  Euronews: a multilingual speech corpus for ASR , 2014, LREC.

[9]  Maurizio Omologo,et al.  Impulse response estimation for robust speech recognition in a reverberant environment , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[10]  Alessio Brutti,et al.  On the use of Early-To-Late Reverberation ratio for ASR in reverberant environments , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Elmar Nöth,et al.  Using Artificially Reverberated Training Data in Distant-Talking ASR , 2005, TSD.

[12]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  Maurizio Omologo,et al.  Hidden Markov model training with contaminated speech material for distant-talking speech recognition , 2002, Comput. Speech Lang..

[15]  Roland Maas,et al.  Spatial diffuseness features for DNN-based speech recognition in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .

[17]  Petros Maragos,et al.  The DIRHA simulated corpus , 2014, LREC.

[18]  Maurizio Omologo,et al.  On the selection of the impulse responses for distant-speech recognition based on contaminated speech training , 2014, INTERSPEECH.

[19]  Maurizio Omologo,et al.  Speaker independent continuous speech recognition using an acoustic-phonetic Italian corpus , 1994, ICSLP.

[20]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[21]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[22]  Georg Heigold,et al.  GMM-free DNN acoustic model training , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[24]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[25]  Walter Kellermann,et al.  Beamforming for Speech and Audio Signals , 2008 .

[26]  Andrey Temko,et al.  Acoustic Event Detection and Classification , 2007, Computers in the Human Interaction Loop.

[27]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[28]  Tatsuya Kawahara,et al.  Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature , 2015, EURASIP J. Adv. Signal Process..

[29]  Christophe Ris,et al.  A corpus-based approach for robust ASR in reverberant environments , 2000, INTERSPEECH.

[30]  Yuuki Tachioka,et al.  The MERL/MELCO/TUM system for the REVERB Challenge using Deep Recurrent Neural Network Feature Enhancement , 2014, ICASSP 2014.

[31]  Alessio Brutti,et al.  A speech event detection and localization task for multiroom environments , 2014, 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA).

[32]  Te-Won Lee,et al.  Blind Speech Separation , 2007, Blind Speech Separation.

[33]  Alexander Gruenstein,et al.  Accurate and compact large vocabulary speech recognition on mobile devices , 2013, INTERSPEECH.

[34]  Martin Wolf,et al.  Channel selection measures for multi-microphone speech recognition , 2014, Speech Commun..