Who can understand your speech better – deep neural network of Gaussian mixture model

Recently we have shown that the context-dependent deep neural network (DNN) hidden Markov model (CD-DNN-HMM) can do surprisingly well for large vocabulary speech recognition (LVSR) as demonstrated on several benchmark tasks. Since then, much work has been done to understand its potential and to further advance the state of the art. In this talk I will share some of these thoughts and introduce some of the recent progresses we have made. In the talk, I will first briefly describe CD-DNN-HMM and bring some insights on why DNNs can do better than the shallow neural networks and Gaussian mixture models. My discussion will be based on the fact that DNN can be considered as a joint model of a complicated feature extractor and a log-linear model. I will then describe how some of the obstacles, such as training speed, decoding speed, sequence-level training, and adaptation, on adopting CD-DNN-HMMs can be removed thanks to recent advances. After that, I will show ways to further improve the DNN structures to achieve better recognition accuracy and to support new scenarios. I will conclude the talk by indicating that DNNs not only do better but also are simpler than GMMs. Bio: Dr. Dong Yu joined Microsoft Corporation in 1998 and Microsoft Speech Research Group in 2002, where he is currently a senior researcher. He holds a PhD degree in computer science from University of Idaho, an MS degree in computer science from Indiana University at Bloomington, an MS degree in electrical engineering from Chinese Academy of Sciences, and a BS degree (with honors) in electrical engineering from Zhejiang University. His recent work focuses on deep neural network and its applications to large vocabulary speech recognition. Dr. Dong Yu has published over 100 papers in speech processing and machine learning and is the inventor/coinventor of around 50 granted/pending patents. He is currently serving as an associate editor of IEEE transactions on audio, speech, and language processing (2011-) and has served as an associate editor of IEEE signal processing magazine (2008-2011) and the lead guest editor of IEEE Transactions on Audio, Speech, and Language Processing special issue on deep learning for speech and language processing (20102011).

[1]  Steve Young,et al.  Large vocabulary speech recognition , 1995 .

[2]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[4]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[5]  Tara N. Sainath,et al.  Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  Li Deng,et al.  Challenges in adopting speech recognition , 2004, CACM.

[7]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[8]  Chin-Hui Lee,et al.  Exploiting deep neural networks for detection-based speech recognition , 2013, Neurocomputing.

[9]  Chin-Hui Lee,et al.  Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Mark J. F. Gales,et al.  Training LVCSR systems on thousands of hours of data , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[11]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[12]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[13]  Dong Yu,et al.  Pipelined Back-Propagation for Context-Dependent Deep Neural Networks , 2012, INTERSPEECH.

[14]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[15]  Dong Yu,et al.  Large Vocabulary Speech Recognition Using Deep Tensor Neural Networks , 2012, INTERSPEECH.

[16]  Dong Yu,et al.  Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[19]  Dong Yu,et al.  Context-dependent Deep Neural Networks for audio indexing of real-life data , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).