Comparison and Combination of Multilayer Perceptrons and Deep Belief Networks in Hybrid Automatic Speech Recognition Systems

To improve the speech recognition performance, many ways to augment or combine HMMs (Hidden Markov Models) with other models to build hybrid architectures have been proposed. The hybrid HMM/ANN (Hidden Markov Model / Artificial Neural Network) architecture is one of the most successful approaches. In this hybrid model, ANNs (which are often multilayer perceptron neural networks - MLPs) are used as an HMM-state posterior estimator. Recently, Deep Belief Networks (DBNs) were introduced as a newly powerful machine learning technique. Generally, DBNs are MLPs with many hidden layers, however, while weights of MLPs are often initialized randomly, DBNs use a greedy layer-by-layer pre- training algorithm to initialize the network weights. This pre- training initialization step has resulted in successful realizations of DBNs for various applications such as handwriting recognition, 3-D object recognition, dimensionality reduction and automatic speech recognition (ASR) tasks. To evaluate the effectiveness of the pre-initialization steps that characterize DBNs from MLPs for ASR tasks, we conduct a comparative evaluation between the two systems on phone recognition for the TIMIT database. The effectiveness, advantages and computational cost of each method will be investigated and analyzed. We also show that the information generated by DBNs and MLPs are complementary, where a consistent improvement is observed when the two systems are combined. In addition, we investigate the ability of the hybrid HMM/DBN system in the case only a limited amount of labeled training data is available.

[1]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[2]  Hervé Bourlard,et al.  Continuous speech recognition by connectionist statistical methods , 1993, IEEE Trans. Neural Networks.

[3]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[4]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[5]  Ruslan Salakhutdinov,et al.  Learning Deep Generative Models , 2009 .

[6]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[7]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[9]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[10]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[11]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[12]  Dong Yu,et al.  Using collective information in semi-supervised learning for speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[14]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[15]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[16]  Geoffrey E. Hinton,et al.  3D Object Recognition with Deep Belief Nets , 2009, NIPS.