Robust CNN-based speech recognition with Gabor filter kernels

As has been extensively shown, acoustic features for speech recognition can be learned from neural networks with multiple hidden layers. However, the learned transformations may not sufficiently generalize to test sets that have a significant mismatch to the training data. Gabor features, on the other hand, are generated from spectro-temporal filters designed to model human auditory processing. In previous work, these features are used as inputs to neural networks, which improved word accuracy for speech recognition in the presence of noise. Here we propose a neural network architecture called a Gabor Convolutional Neural Network (GCNN) that incorporates Gabor functions into convolutional filter kernels. In this architecture, a variety of Gabor features served as the multiple feature maps of the convolutional layer. The filter coefficients are further tuned by back-propagation training. Experiments used two noisy versions of the WSJ corpus: Aurora 4, and RATS re-noised WSJ. In both cases, the proposed architecture performs better than other noise-robust features that we have tried, namely, ETSI-AFE, PNCC, Gabor features without the CNN-based approach, and our best neural network features that don’t incorporate Gabor functions.

[1]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2]  Alex Acero,et al.  Noise adaptive training using a vector taylor series approach for noise robust automatic speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[4]  Nelson Morgan,et al.  Easy does it: Robust spectro-temporal many-stream ASR without fine tuning streams , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  DeLiang Wang,et al.  An auditory-based feature for robust speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Nelson Morgan,et al.  Informative spectro-temporal bottleneck features for noise-robust speech recognition , 2013, INTERSPEECH.

[9]  Arindam Mandal,et al.  Normalized amplitude modulation features for large vocabulary noise-robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[11]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[12]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[14]  Andreas Stolcke,et al.  On using MLP features in LVCSR , 2004, INTERSPEECH.

[15]  Kevin Walker,et al.  The RATS radio traffic collection system , 2012, Odyssey.

[16]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[17]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Dimitrios Dimitriadis,et al.  Investigating deep neural network based transforms of robust audio features for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Nelson Morgan,et al.  Tonotopic multi-layered perceptron: a neural network for learning long-term temporal features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20]  Naveen Parihar,et al.  Performance analysis of the Aurora large vocabulary baseline system , 2004, 2004 12th European Signal Processing Conference.

[21]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[22]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[23]  Mark J. F. Gales,et al.  Factor analysis based VTS discriminative adaptive training , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.