On the importance of modeling and robustness for deep neural network feature

A large body of research has shown that acoustic features for speech recognition can be learned from data using neural networks with multiple hidden layers (DNNs) and that these learned features are superior to standard features (e.g., MFCCs). However, this superiority is usually demonstrated when the data used to learn the features is very similar in character to the data used to test recognition performance. An open question is how well these learned features generalize to realistic data that is different in character to their training data. The ability of a feature representation to generalize to unfamiliar data is a highly desirable form of robustness. In this paper we investigate the robustness of two DNN-based feature sets to training/test mismatch using the ICSI meeting corpus. The experiments were performed under 3 training/test scenarios: (1) matched near-field (2) matched far-field and (3) the mismatched condition near-field training with far-field testing. The experiments leverage simulation and a novel sampling process that we have developed for diagnostic analysis within the HMM-based speech recognition framework. First, diagnostic analysis shows that a DNN-based feature representation that uses MFCC inputs (MFCC-DNN) is indeed superior to the corresponding MFCC baselines in the two matched scenarios where the source of recognition errors are from incorrect model, but the DNN-based features and MFCCs have nearly identical and poor performance in the mismatched scenario. Second, we show that a DNN-based feature representation that uses a more robust input, namely power normalized spectrum (PNS) and Gabor filters, performs nearly as well as the MFCC-DNN features in the matched scenarios and much better than MFCCs and MFCC-DNNs in the mismatched scenario.

[1]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Li Deng,et al.  Ensemble deep learning for speech recognition , 2014, INTERSPEECH.

[3]  Andreas Stolcke,et al.  The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System , 2007, CLEAR.

[4]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[5]  Nima Mesgarani,et al.  Speech processing with a cortical representation of audio , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Andreas Stolcke,et al.  Language Modeling in the ICSI-SRI Spring 2005 Meeting Speech Recognition Evaluation System , 2005 .

[8]  Bernd T. Meyer,et al.  Spectro-temporal features for noise-robust speech recognition using power-law nonlinearity and power-bias subtraction , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[10]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[11]  Andreas Stolcke,et al.  On using MLP features in LVCSR , 2004, INTERSPEECH.

[12]  Sree Hari Krishnan Parthasarathi,et al.  The blame game in meeting room ASR: An analysis of feature versus model errors in noisy and mismatched conditions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14]  Larry Gillick,et al.  Don't multiply lightly: Quantifying problems with the acoustic model assumptions in speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[16]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[19]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[21]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.