Modeling long temporal contexts for robust DNN-based speech recognition

Deep Neural Networks (DNNs) have been shown to outperform traditional Gaussian Mixture Models in many Automatic Speech Recognition tasks. In this work, we investigate the potential of modeling long temporal acoustic contexts using DNNs. The complete temporal context is split into several subcontexts. Multiple sub-context DNNs initialized with the same set of Restricted Boltzmann Machines are fine-tuned independently and their last hidden layer activations are combined to jointly predict the desired state posteriors through a single softmax output layer. From preliminary experiments on the Aurora2 multi-style training task, our proposed system models a 65-frame temporal window of speech signals and yields a 4.4% WER, outperforming the best single DNN by 12.0% relatively. With the local independence assumption, both training and testing of the sub-context DNNs can be done in parallel. Moreover, our system has a relative 48.2% parameter reduction compared to a single DNN with the same amount of hidden units.

[1]  Alex Acero,et al.  Noise Adaptive Training for Robust Automatic Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[5]  Daniel P. W. Ellis,et al.  Investigations into tandem acoustic modeling for the Aurora task , 2001, INTERSPEECH.

[6]  Alexander H. Waibel,et al.  Modular Construction of Time-Delay Neural Networks for Speech Recognition , 1989, Neural Computation.

[7]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[8]  Harvey b. Fletcher,et al.  Speech and hearing in communication , 1953 .

[9]  Oriol Vinyals,et al.  Deep vs. wide: depth on a budget for robust speech recognition , 2013, INTERSPEECH.

[10]  Denis Jouvet,et al.  Evaluation of a noise-robust DSR front-end on Aurora databases , 2002, INTERSPEECH.

[11]  Hynek Hermansky,et al.  Multistream approach to robust speech recognition , 1999 .

[12]  Satoshi Nakamura,et al.  Soft margin estimation on improving environment structures for ensemble speaker and speaking environment modeling , 2009, IUCS '09.

[13]  Mark J. F. Gales,et al.  Extended VTS for Noise-Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Jasha Droppo Feature Compensation , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[17]  Nelson Morgan,et al.  Deep and Wide: Multiple Layers in Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Chin-Hui Lee,et al.  Speech Recognition Using Long-Span Temporal Patterns in a Deep Network Model , 2013, IEEE Signal Processing Letters.

[20]  Khe Chai Sim,et al.  Improving robustness of deep neural networks via spectral masking for automatic speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[21]  Terrence J. Sejnowski Well-connected Brains , 2012 .

[22]  Wonkyum Lee,et al.  Modular combination of deep neural networks for acoustic modeling , 2013, INTERSPEECH.

[23]  Haizhou Li,et al.  Lasso environment model combination for robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Mark J. F. Gales,et al.  Structured discriminative models for noise robust continuous speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Hynek Hermansky,et al.  Sub-band based recognition of noisy speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Hynek Hermansky,et al.  Long, Deep and Wide Artificial Neural Nets for Dealing with Unexpected Noise in Machine Recognition of Speech , 2013, TSD.

[27]  George Saon,et al.  Robust digit recognition in noisy environments: the IBM Aurora 2 system , 2001, INTERSPEECH.

[28]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[29]  Hynek Hermansky,et al.  Beyond a single critical-band in TRAP based ASR , 2003, INTERSPEECH.