Semi-supervised Training for End-to-end Models via Weak Distillation

End-to-end (E2E) models are a promising research direction in speech recognition, as the single all-neural E2E system offers a much simpler and more compact solution compared to a conventional model, which has a separate acoustic (AM), pronunciation (PM) and language model (LM). However, it has been noted that E2E models perform poorly on tail words and proper nouns, likely because the end-to-end optimization requires joint audio-text pairs, and does not take advantage of additional lexicons and large amounts of text-only data used to train the LMs in conventional models. There has been numerous efforts in training an RNN-LM on text-only data and fusing it into the end-to-end model. In this work, we contrast this approach to training the E2E model with audio-text pairs generated from unsupervised speech data. To target the proper noun issue specifically, we adopt a Part-of-Speech (POS) tagger to filter the unsupervised data to use only those with proper nouns. We show that training with filtered unsupervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up to a 17% relative improvement.

[1]  Kenneth Ward Church,et al.  Deep neural network features and semi-supervised training for low resource speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[4]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[5]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[6]  Brian Roark,et al.  Bringing contextual information to google speech recognition , 2015, INTERSPEECH.

[7]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[8]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[9]  Tara N. Sainath,et al.  A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[10]  Quoc V. Le,et al.  Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[11]  Slav Petrov,et al.  Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[12]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[14]  Yifan Gong,et al.  Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration , 2013, INTERSPEECH.

[15]  Olivier Siohan,et al.  A big data approach to acoustic model training corpus selection , 2014, INTERSPEECH.

[16]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[17]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[18]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[22]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[23]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[24]  Richard M. Schwartz,et al.  Unsupervised versus supervised training of acoustic models , 2008, INTERSPEECH.

[25]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[28]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[29]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[30]  Mark J. F. Gales,et al.  Unsupervised training and directed manual transcription for LVCSR , 2010, Speech Commun..