论文信息 - Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

Teacher-student (T/S) has shown to be effective for domain adaptation of deep neural network acoustic models in hybrid speech recognition systems. In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance. To further improve T/S learning with the help of ground-truth labels, we propose adaptive T/S (AT/S) learning. Instead of conditionally choosing from either the teacher's soft token posteriors or the one-hot ground-truth label, in AT/S, the student always learns from both the teacher and the ground truth with a pair of adaptive weights assigned to the soft and one-hot labels quantifying the confidence on each of the knowledge sources. The confidence scores are dynamically estimated at each decoder step as a function of the soft and one-hot labels. With 3400 hours parallel close-talk and far-field Microsoft Cortana data for domain adaptation, T/S and AT/S achieves 6.3% and 10.3% relative word error rate improvement over a strong E2E model trained with the same amount of far-field data.

[1] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[2] Yifan Gong,et al. Advancing Acoustic-to-Word CTC Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Xiaodong Cui,et al. English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[4] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[5] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Yifan Gong,et al. Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[7] John R. Hershey,et al. Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Yifan Gong,et al. Conditional Teacher-student Learning , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Jonathan Le Roux,et al. Student-teacher network learning with enhanced features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Biing-Hwang Juang,et al. Adversarial Feature-Mapping for Speech Enhancement , 2018, INTERSPEECH.

[11] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[12] Biing-Hwang Juang,et al. Speaker-Invariant Training Via Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[14] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[15] Jon Barker,et al. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[16] Yashesh Gaur,et al. Speaker Adaptation for Attention-Based End-to-End Speech Recognition , 2019, INTERSPEECH.

[17] Yifan Gong,et al. Attentive Adversarial Learning for Domain-invariant Training , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Yifan Gong,et al. Unsupervised adaptation with domain separation networks for robust speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19] Jonathan Le Roux,et al. Multi-Channel Speech Recognition : LSTMs All the Way Through , 2016 .

[20] Yifan Gong,et al. Adversarial Speaker Adaptation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[22] Xiong Xiao,et al. Developing Far-Field Speaker System Via Teacher-Student Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24] Koichi Shinoda,et al. Sequence-level Knowledge Distillation for Model Compression of Attention-based Sequence-to-sequence Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Navdeep Jaitly,et al. Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[26] Tara N. Sainath,et al. Compression of End-to-End Models , 2018, INTERSPEECH.

[27] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[28] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[29] Alexander M. Rush,et al. Sequence-Level Knowledge Distillation , 2016, EMNLP.

[30] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[31] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[32] Biing-Hwang Juang,et al. Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] Yashesh Gaur,et al. Character-Aware Attention-Based End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[34] Biing-Hwang Juang,et al. Cycle-Consistent Speech Enhancement , 2018, INTERSPEECH.

[35] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36] Yifan Gong,et al. Large-Scale Domain Adaptation via Teacher-Student Learning , 2017, INTERSPEECH.

[37] Mark J. F. Gales,et al. Sequence Student-Teacher Training of Deep Neural Networks , 2016, INTERSPEECH.

[38] Ryo Masumura,et al. Domain adaptation of DNN acoustic models using knowledge distillation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Yanning Zhang,et al. An unsupervised deep domain adaptation approach for robust speech recognition , 2017, Neurocomputing.

[40] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[42] Yusuke Shinohara,et al. Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition , 2016, INTERSPEECH.