论文信息 - Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.

[1] Frank Rosenblatt,et al. PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[2] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[3] Horacio Franco,et al. Context-dependent connectionist probability estimation in a hybrid hidden Markov model-neural net speech recognition system , 1994, Comput. Speech Lang..

[4] Hervé Bourlard,et al. Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[5] Horacio Franco,et al. Connectionist speaker normalization and adaptation , 1995, EUROSPEECH.

[6] Michael I. Jordan,et al. Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[7] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[8] Michael Finke,et al. ACID/HNN: clustering hierarchies of neural networks for context-dependent connectionist acoustic modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9] Jonathan G. Fiscus,et al. 2000 NIST EVALUATION OF CONVERSATIONAL SPEECH RECOGNITION OVER THE TELEPHONE: ENGLISH AND MANDAR IN PERFORMANCE RESULTS , 2000 .

[10] George Saon,et al. Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[12] Andreas Stolcke,et al. Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[14] J. Makhoul,et al. Introduction to the Special Section on Rich Transcription , 2006 .

[15] Geoffrey E. Hinton,et al. Deep Belief Networks for phone recognition , 2009 .

[16] Dong Yu,et al. Deep-structured hidden conditional random fields for phonetic recognition , 2010, INTERSPEECH.

[17] Dong Yu,et al. Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition , 2010 .

[18] Dong Yu,et al. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[19] Tara N. Sainath,et al. Deep Belief Networks using discriminative features for phone recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21] Geoffrey E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.