论文信息 - Semi-supervised Training for End-to-end Models via Weak Distillation

Semi-supervised Training for End-to-end Models via Weak Distillation

End-to-end (E2E) models are a promising research direction in speech recognition, as the single all-neural E2E system offers a much simpler and more compact solution compared to a conventional model, which has a separate acoustic (AM), pronunciation (PM) and language model (LM). However, it has been noted that E2E models perform poorly on tail words and proper nouns, likely because the end-to-end optimization requires joint audio-text pairs, and does not take advantage of additional lexicons and large amounts of text-only data used to train the LMs in conventional models. There has been numerous efforts in training an RNN-LM on text-only data and fusing it into the end-to-end model. In this work, we contrast this approach to training the E2E model with audio-text pairs generated from unsupervised speech data. To target the proper noun issue specifically, we adopt a Part-of-Speech (POS) tagger to filter the unsupervised data to use only those with proper nouns. We show that training with filtered unsupervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up to a 17% relative improvement.

[1] Kenneth Ward Church,et al. Deep neural network features and semi-supervised training for low resource speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3] Jean-Luc Gauvain,et al. Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[4] Tara N. Sainath,et al. Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[5] Yoshua Bengio,et al. On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[6] Brian Roark,et al. Bringing contextual information to google speech recognition , 2015, INTERSPEECH.

[7] Adam Coates,et al. Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[8] Andrew W. Senior,et al. Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[9] Tara N. Sainath,et al. A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[10] Quoc V. Le,et al. Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[11] Slav Petrov,et al. Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[12] Hank Liao,et al. Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[14] Yifan Gong,et al. Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration , 2013, INTERSPEECH.

[15] Olivier Siohan,et al. A big data approach to acoustic model training corpus selection , 2014, INTERSPEECH.

[16] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[17] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[18] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[19] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[20] Tara N. Sainath,et al. Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Quoc V. Le,et al. Listen, Attend and Spell , 2015, ArXiv.

[22] Wei Xu,et al. Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[23] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[24] Richard M. Schwartz,et al. Unsupervised versus supervised training of acoustic models , 2008, INTERSPEECH.

[25] Tara N. Sainath,et al. An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Tara N. Sainath,et al. Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[28] Alexander M. Rush,et al. Sequence-Level Knowledge Distillation , 2016, EMNLP.

[29] Navdeep Jaitly,et al. Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[30] Mark J. F. Gales,et al. Unsupervised training and directed manual transcription for LVCSR , 2010, Speech Commun..