Auto-encoder bottleneck features using deep belief networks

Neural network (NN) bottleneck (BN) features are typically created by training a NN with a middle bottleneck layer. Recently, an alternative structure was proposed which trains a NN with a constant number of hidden units to predict output targets, and then reduces the dimensionality of these output probabilities through an auto-encoder, to create auto-encoder bottleneck (AE-BN) features. The benefit of placing the BN after the posterior estimation network is that it avoids the loss in frame classification accuracy incurred by networks that place the BN before the softmax. In this work, we investigate the use of pre-training when creating AE-BN features. Our experiments indicate that with the AE-BN architecture, pre-trained and deeper NNs produce better AE-BN features. On a 50-hour English Broadcast News task, the AE-BN features provide over a 1% absolute improvement compared to a state-of-the-art GMM/HMM with a WER of 18.8% and pre-trained NN hybrid system with a WER of 18.4%. In addition, on a larger 430-hour Broadcast News task, AE-BN features provide a 0.5% absolute improvement over a strong GMM/HMM baseline with a WER of 16.0%. Finally, system combination with the GMM/HMM baseline and AE-BN systems provides an additional 0.5% absolute on 430 hours over the AE-BN system alone, yielding a final WER of 15.0%.

[1]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[2]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[4]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Brian Kingsbury,et al.  The IBM 2008 GALE Arabic speech transcription system , 2010, ICASSP.

[8]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[9]  The IBM 2009 GALE Arabic speech transcription system , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Jen-Tzung Chien,et al.  Discriminative training for Bayesian sensing hidden Markov models , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tara N. Sainath,et al.  Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[12]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[13]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.