论文信息 - Improving deep neural network acoustic models using unlabeled data

Improving deep neural network acoustic models using unlabeled data

The Context-Dependent Deep-Neural-Network HMM, or CD-DNN-HMM, is a powerful acoustic modeling technique. Its training process typically involves unsupervised pre-training and supervised fine-tuning. In the paper, we demonstrate that the performance of DNNs can be improved by utilizing a large amount of unlabeled data in the training procedure. In our method, CD-DNN-HMM trained using 309 hours of unlabeled data and 24 hours of labeled data achieved word-error rate of 23.7% on the Hub5'00-SWB phone-call transcription task, compared to word-error rate of 24.3% obtained by a CD-DNN-HMM trained without using unlabeled data. We also applied a priori probability smoothing algorithm that further reduced the error rate to 23.2%. On RT03S-FSH benchmark corpus, our experimental results show that similar performance gains can be obtained by the use of unlabeled data.

Meng Cai | Jia Liu | Wei-Qiang Zhang

[1] Dong Yu,et al. Parallel Training for Deep Stacking Networks , 2012, INTERSPEECH.

[2] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[3] Geoffrey E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[4] Dong Yu,et al. Large vocabulary continuous speech recognition with context-dependent DBN-HMMS , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Geoffrey E. Hinton,et al. Deep Belief Networks for phone recognition , 2009 .

[6] Dong Yu,et al. Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition , 2010 .

[7] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[8] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[9] John J. Godfrey,et al. SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10] Hervé Bourlard,et al. Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[11] Geoffrey E. Hinton,et al. Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[13] Dong Yu,et al. The Deep Tensor Neural Network With Applications to Large Vocabulary Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Volodymyr Mnih,et al. CUDAMat: a CUDA-based matrix class for Python , 2009 .

[15] Jia Liu,et al. Strategies for using MLP based features with limited target-language training data , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16] Dong Yu,et al. Pipelined Back-Propagation for Context-Dependent Deep Neural Networks , 2012, INTERSPEECH.

[17] Dong Yu,et al. Deep Convex Net: A Scalable Architecture for Speech Pattern Classification , 2011, INTERSPEECH.

[18] Dong Yu,et al. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[19] Dong Yu,et al. A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Geoffrey E. Hinton,et al. Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[21] Geoffrey Zweig,et al. Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22] Geoffrey E. Hinton,et al. Learning representations of back-propagation errors , 1986 .

[23] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.