Improving semi-supervised deep neural network for keyword search in low resource languages

In this work, we investigate how to improve semi-supervised DNN for low resource languages where the initial systems may have high error rate. We propose using semi-supervised MLP features for DNN training, and we also explore using confidence to improve semi-supervised cross entropy and sequence training. The work conducted in this paper was evaluated under the IARPA Babel program for the keyword spotting tasks. We focus on the limited condition where there are around 10 hours of supervised data for training.

[1]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[2]  Kenneth Ward Church,et al.  Deep neural network features and semi-supervised training for low resource speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Martin Karafiát,et al.  Semi-supervised bootstrapping approach for neural network feature extractor training , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[4]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[5]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[6]  Spyridon Matsoukas,et al.  Region Dependent Transform on MLP Features for Speech Recognition , 2011, INTERSPEECH.

[7]  Dong Yu,et al.  Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[9]  Yifan Gong,et al.  Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration , 2013, INTERSPEECH.

[10]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Mark J. F. Gales,et al.  Unsupervised training and directed manual transcription for LVCSR , 2010, Speech Commun..

[12]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13]  Lukás Burget,et al.  Semi-supervised training of Deep Neural Networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[16]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Richard M. Schwartz,et al.  Discriminative semi-supervised training for keyword search in low resource languages , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.