A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion

We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided by domain knowledge about how speech classes would change when formant frequencies are modified. The convolution and heterogeneous-pooling layers are followed by a fully connected multi-layer neural network to form a deep architecture interfaced to an HMM for continuous speech recognition. During training, all layers of this entire deep net are regularized using a variant of the “dropout” technique. Experimental evaluation demonstrates the effectiveness of both heterogeneous pooling and dropout regularization. On the TIMIT phonetic recognition task, we have achieved an 18.7% phone error rate, lowest on this standard task reported in the literature with a single system and with no use of information about speaker identity. Preliminary experiments on large vocabulary speech recognition in a voice search task also show error rate reduction using heterogeneous pooling in the deep convolutional neural network.

[1]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2]  L Deng,et al.  Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics. , 2000, The Journal of the Acoustical Society of America.

[3]  M. Pitermann Effect of speaking rate and contrastive stress on formant dynamics and vowel perception. , 2000, The Journal of the Acoustical Society of America.

[4]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[5]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[6]  Dong Yu,et al.  A bidirectional target-filtering model of speech coarticulation and reduction: two-stage implementation for phonetic recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Dong Yu,et al.  Structured speech modeling , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Dong Yu,et al.  A Bidirectional Target Filtering Model of Speech Coarticulation: two-stage Implementation for Phonetic Recognition , 2006 .

[9]  Dong Yu,et al.  Use of Differential Cepstra as Acoustic Features in Hidden Trajectory Modeling for Phonetic Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Quoc V. Le,et al.  Tiled convolutional neural networks , 2010, NIPS.

[11]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[12]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[13]  Dong Yu,et al.  Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition , 2010 .

[14]  Dong Yu,et al.  Large vocabulary continuous speech recognition with context-dependent DBN-HMMS , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[16]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[17]  Tara N. Sainath,et al.  Deep Belief Networks using discriminative features for phone recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Tara N. Sainath,et al.  Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Gökhan Tür,et al.  Use of kernel deep convex networks and end-to-end learning for spoken language understanding , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[20]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[21]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[22]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[23]  Dong Yu,et al.  Scalable stacking and learning for building deep architectures , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Tara N. Sainath,et al.  Enhancing Exemplar-Based Posteriors for Speech Recognition Tasks , 2012, INTERSPEECH.

[26]  Nelson Morgan,et al.  Deep and Wide: Multiple Layers in Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[29]  Gökhan Tür,et al.  Towards deeper understanding: Deep convex networks for semantic utterance classification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[31]  Dong Yu,et al.  Pipelined Back-Propagation for Context-Dependent Deep Neural Networks , 2012, INTERSPEECH.

[32]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[34]  Dong Yu,et al.  Tensor Deep Stacking Networks , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Dong Yu,et al.  The Deep Tensor Neural Network With Applications to Large Vocabulary Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.