Boosted acoustic model learning and hypotheses rescoring on the CHiME-3 task

Speech recognition in a realistic noisy environment using multiple microphones is the focal point of the third CHiME challenge. Over the baseline ASR system provided for this challenge, we apply state of the art algorithms for boosting acoustic model learning and hypothesis rescoring to improve the final output. To this aim, we first use the automatic transcription of each channel to re-train the acoustic model for that channel and then we apply linear language model rescoring to find a better solution in the n-best list. LM rescoring is performed using an efficient set of N-gram and Recurrent Neural Network LM (RNNLM) trained on a wisely-selected text set. In the experiments, we show that the proposed approach improves not only the individual channel transcription, but also the enhanced channels produced by MVDR and delay-and-sum beamforming.

[1]  X. Mestre,et al.  On diagonal loading for minimum variance beamformers , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[2]  Jonathan G. Fiscus,et al.  REDUCED WORD ERROR RATES , 1997 .

[3]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[4]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[6]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[7]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[8]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  José Guilherme Camargo de Souza,et al.  Quality Estimation for Automatic Speech Recognition , 2014, COLING.

[10]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[11]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[12]  Biing-Hwang Juang,et al.  Recurrent deep neural networks for robust speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Abhinav Sethy,et al.  Resampling auxiliary data for language model adaptation in machine translation for speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[17]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[18]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[19]  Geoffrey Zweig,et al.  Cache based recurrent neural network language model inference for first pass speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[21]  Daniele Falavigna,et al.  Driving ROVER with Segment-based ASR Quality Estimation , 2015, ACL.

[22]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[23]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[24]  Tomas Mikolov,et al.  RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[25]  Xiao Li,et al.  Regularized Adaptation of Discriminative Classifiers , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.