论文信息 - Acoustic modeling with neural graph embeddings

Acoustic modeling with neural graph embeddings

Graph-based learning (GBL) is a form of semi-supervised learning that has been successfully exploited in acoustic modeling in the past. It utilizes manifold information in speech data that is represented as a joint similarity graph over training and test samples. Typically, GBL is used at the output level of an acoustic classifier; however, this setup is difficult to scale to large data sets, and the graph-based learner is not optimized jointly with other components of the speech recognition system. In this paper we explore a different approach where the similarity graph is first embedded into continuous space using a neural autoencoder. Features derived from this encoding are then used at the input level to a standard DNN-based speech recognizer. We demonstrate improved scalability and performance compared to the standard GBL approach as well as significant improvements in word error rate on a medium-vocabulary Switchboard task.

Yuzong Liu | Katrin Kirchhoff

[1] Katrin Kirchhoff,et al. Graph-based learning for phonetic classification , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[2] Larry P. Heck,et al. Deep learning of knowledge graph embeddings for semantic parsing of Twitter dialogs , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[3] George Saon,et al. The IBM 2015 English conversational telephone speech recognition system , 2015, INTERSPEECH.

[4] Jeff A. Bilmes,et al. Unsupervised submodular subset selection for speech data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Jeff A. Bilmes,et al. Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification , 2009, NIPS.

[6] M. Orbach,et al. Transductive phoneme classification using local scaling and confidence , 2012, 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel.

[7] Jeff A. Bilmes,et al. Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[9] Jeff A. Bilmes,et al. Semi-Supervised Learning with Measure Propagation , 2011, J. Mach. Learn. Res..

[10] Tara N. Sainath,et al. Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.

[12] Joshua B. Tenenbaum,et al. Global Versus Local Methods in Nonlinear Dimensionality Reduction , 2002, NIPS.

[13] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14] Dong Yu,et al. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[15] Yuzong Liu,et al. Graph-based semi-supervised learning for phone and segment classification , 2013, INTERSPEECH.

[16] Kai Li,et al. Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[17] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[18] Enhong Chen,et al. Learning Deep Representations for Graph Clustering , 2014, AAAI.

[19] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[20] Xiaohui Zhang,et al. Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Yousef Saad,et al. Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection , 2009, J. Mach. Learn. Res..

[22] Xiaojin Zhu,et al. Seeing stars when there aren’t many stars: Graph-based semi-supervised learning for sentiment categorization , 2006 .

[23] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24] Jeff A. Bilmes,et al. Submodular feature selection for high-dimensional acoustic score spaces , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[26] Rishabh K. Iyer,et al. SVitchboard II and fiSVer i: high-quality limited-complexity corpora of conversational English speech , 2015, INTERSPEECH.

[27] Yuzong Liu,et al. Graph-based semi-supervised acoustic modeling in DNN-based speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).