Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR

This study investigated large-scale semi-supervised training (SST) to improve acoustic models for automatic speech recognition. The conventional self-training, the recently proposed committee-based SST using heterogeneous neural networks and the lattice-based SST were examined and compared. The large-scale SST was studied in deep neural network acoustic modeling with respect to the automatic transcription quality, the importance data filtering, the training data quantity and other data attributes of a large quantity of multi-genre unsupervised live data. We found that the SST behavior on large-scale ASR tasks was very different from the behavior obtained on small-scale SST: 1) big data can tolerate a certain degree of mislabeling in the automatic transcription for SST. It is possible to achieve further performance gains with more unsupervised fresh data, and even the automatic transcriptions have a certain degree of errors; 2) the audio attributes, transcription quality and importance of the fresh data are more important than the increased data quantity for large-scale SST; and 3) there are large differences in performance gains on different recognition tasks, such that the benefits highly depend on the selected data attributes of unsupervised data and the data scale of the baseline ASR system. Furthermore, we proposed a novel utterance filtering approach based on active learning to improve the data selection in large-scale SST. The experimental results showed that the SST with the proposed data filtering yields a 2-11% relative word error rate reduction on five multi-genre recognition tasks, even with the baseline acoustic model that was already well trained on a 10000-hr supervised dataset.

[1]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[2]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[3]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[5]  Dilek Z. Hakkani-Tür,et al.  Active learning: theory and applications to automatic speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[6]  Tomoharu Iwata,et al.  Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[8]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Lukás Burget,et al.  Semi-supervised training of Deep Neural Networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[10]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[11]  Tatsuya Kawahara,et al.  Semi-supervised ensemble DNN acoustic model training , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Haizhou Li,et al.  Semi-Supervised and Cross-Lingual Knowledge Transfer Learnings for DNN Hybrid Acoustic Models Under Low-Resource Conditions , 2016, INTERSPEECH.

[13]  Tomoharu Iwata,et al.  Semi-Supervised End-to-End Speech Recognition , 2018, INTERSPEECH.

[14]  Mark J. F. Gales,et al.  Selection of Multi-Genre Broadcast Data for the Training of Automatic Speech Recognition Systems , 2016, INTERSPEECH.

[15]  Hong-Goo Kang,et al.  Gradient-based Active Learning Query Strategy for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Mark Hasegawa-Johnson,et al.  Adapting ASR for under-resourced languages using mismatched transcriptions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[18]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[19]  Richard M. Schwartz,et al.  Unsupervised Training on Large Amounts of Broadcast News Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  Yonghong Yan,et al.  An Exploration of Dropout with LSTMs , 2017, INTERSPEECH.

[21]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[22]  Yifan Gong,et al.  Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration , 2013, INTERSPEECH.

[23]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[24]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[25]  Hairong Liu,et al.  Active Learning for Speech Recognition: the Power of Gradients , 2016, ArXiv.

[26]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[27]  Yongqiang Wang,et al.  Semi-Supervised Training in Deep Learning Acoustic Model , 2016, INTERSPEECH.

[28]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Richard M. Schwartz,et al.  Discriminative semi-supervised training for keyword search in low resource languages , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[30]  Yiming Wang,et al.  Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[31]  Koichi Shinoda,et al.  Acoustic model training using committee-based active and semi-supervised learning for speech recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[32]  Kai Yu,et al.  Very deep convolutional neural networks for LVCSR , 2015, INTERSPEECH.

[33]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[34]  Dong Yu,et al.  Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global Entropy Reduction Maximization Criterion Computer Speech and Language Article in Press Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global E , 2022 .

[35]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Rishabh K. Iyer,et al.  Submodularity in Data Subset Selection and Active Learning , 2015, ICML.

[37]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[38]  Sanjeev Khudanpur,et al.  Semi-Supervised Training of Acoustic Models Using Lattice-Free MMI , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Naoyuki Kanda,et al.  Investigation of Semi-Supervised Acoustic Model Training Based on the Committee of Heterogeneous Neural Networks , 2016, INTERSPEECH.

[40]  Satoshi Nakamura,et al.  Transcribing against time , 2017, Speech Commun..

[41]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[42]  Janne Pylkkönen,et al.  Active and Semi-Supervised Learning in ASR: Benefits on the Acoustic and Language Models , 2016, INTERSPEECH.

[43]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[44]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[45]  Mark J. F. Gales,et al.  Language independent and unsupervised acoustic models for speech recognition and keyword spotting , 2014, INTERSPEECH.

[46]  Jing Huang,et al.  Multi-View and Multi-Objective Semi-Supervised Learning for HMM-Based Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[47]  Xiaohui Zhang,et al.  Backstitch: Counteracting Finite-Sample Bias via Negative Steps , 2017, INTERSPEECH.

[48]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[49]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.