Efficient training of acoustic models for reverberation-robust medium-vocabulary automatic speech recognition

A recently proposed concept for training reverberation-robust acoustic models for automatic speech recognition using pairs of clean and reverberant data is extended from word models to tied-state triphone models in this paper. The key idea of the concept, termed ICEWIND, is to use the clean data for the temporal alignment and the reverberant data for the estimation of the emission densities. Experiments with the 5000-word Wall Street Journal corpus confirm the benefits of ICEWIND with tied-state triphones: While the training time is reduced by more than 90%, the word accuracy is improved at the same time, both for room-specific and multi-style hidden Markov models. Since the acoustic models trained with ICEWIND need less Gaussian components for the emission densities to achieve comparable recognition rates as Baum-Welch acoustic models, ICEWIND also allows for a reduced decoding complexity.

[1]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Steve J. Young,et al.  The use of state tying in continuous speech recognition , 1993, EUROSPEECH.

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Maurizio Omologo,et al.  Hidden Markov model training with contaminated speech material for distant-talking speech recognition , 2002, Comput. Speech Lang..

[5]  Roland Maas,et al.  Model-based dereverberation in the logmelspec domain for robust distant-talking speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Frederick Jelinek,et al.  Probabilistic classification of HMM states for large vocabulary continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[7]  Walter Kellermann,et al.  A New Concept for Feature-Domain Dereverberation for Robust Distant-Talking ASR , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[9]  Ulpu Remes,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012 .

[10]  Shinji Watanabe,et al.  Static and Dynamic Variance Compensation for Recognition of Reverberant Speech With Dereverberation Preprocessing , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Bhiksha Raj,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[13]  Reinhold Haeb-Umbach,et al.  Robust Speech Recognition of Uncertain or Missing Data - Theory and Applications , 2011 .

[14]  Li Deng,et al.  Large-vocabulary speech recognition under adverse acoustic environments , 2000, INTERSPEECH.

[15]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[16]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[17]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[18]  Jeff Siu-Kei Au-Yeung,et al.  Improved performance of Aurora 4 using HTK and unsupervised MLLR adaptation , 2004, INTERSPEECH.

[19]  Alexander Fischer,et al.  Acoustic synthesis of training data for speech recognition in living room environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  N. Merhav,et al.  Hidden Markov modeling using a dominant state sequence with application to speech recognition , 1991 .

[21]  Reinhold Häb-Umbach,et al.  Model-Based Feature Enhancement for Reverberant Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Roland Maas,et al.  Multi-style training of HMMS with stereo data for reverberation-robust speech recognition , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[23]  Roland Maas,et al.  A novel approach for matched reverberant training of HMMs using data pairs , 2010, INTERSPEECH.

[24]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[25]  Jon Barker,et al.  The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[26]  Roland Maas,et al.  Reverberation Model-Based Decoding in the Logmelspec Domain for Robust Distant-Talking Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.