论文信息 - Speaker Adaptation for Attention-Based End-to-End Speech Recognition

Speaker Adaptation for Attention-Based End-to-End Speech Recognition

We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition. The first method is Kullback-Leibler divergence (KLD) regularization, in which the output distribution of a speaker-dependent (SD) AED is forced to be close to that of the speaker-independent (SI) model by adding a KLD regularization to the adaptation criterion. To compensate for the asymmetric deficiency in KLD regularization, an adversarial speaker adaptation (ASA) method is proposed to regularize the deep-feature distribution of the SD AED through the adversarial learning of an auxiliary discriminator and the SD AED. The third approach is the multi-task learning, in which an SD AED is trained to jointly perform the primary task of predicting a large number of output units and an auxiliary task of predicting a small number of output units to alleviate the target sparsity issue. Evaluated on a Microsoft short message dictation task, all three methods are highly effective in adapting the AED model, achieving up to 12.2% and 3.0% word error rate improvement over an SI AED trained from 3400 hours data for supervised and unsupervised adaptation, respectively.

Yashesh Gaur | Zhong Meng | Yifan Gong | Jinyu Li

[1] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[3] John R. Hershey,et al. Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Yifan Gong,et al. Conditional Teacher-student Learning , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Antonio Bonafonte,et al. SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[6] Kaisheng Yao,et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[8] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[10] Ji Wu,et al. Rapid adaptation for deep neural networks through multi-task learning , 2015, INTERSPEECH.

[11] Biing-Hwang Juang,et al. Adversarial Feature-Mapping for Speech Enhancement , 2018, INTERSPEECH.

[12] Dong Yu,et al. Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[13] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[14] Yusuke Shinohara,et al. Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition , 2016, INTERSPEECH.

[15] Jonathan Le Roux,et al. Multi-Channel Speech Recognition : LSTMs All the Way Through , 2016 .

[16] Yanning Zhang,et al. An unsupervised deep domain adaptation approach for robust speech recognition , 2017, Neurocomputing.

[17] Yifan Gong,et al. Low-rank plus diagonal adaptation for deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[20] Yifan Gong,et al. Attentive Adversarial Learning for Domain-invariant Training , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Khe Chai Sim,et al. Factorized Hidden Layer Adaptation for Deep Neural Network Based Acoustic Modeling , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22] Victor S. Lempitsky,et al. Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[23] Yifan Gong,et al. Unsupervised adaptation with domain separation networks for robust speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[24] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[25] Li-Rong Dai,et al. Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26] Shigeru Katagiri,et al. Speaker Adaptation for Multichannel End-to-End Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Yifan Gong,et al. Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.

[28] Yifan Gong,et al. Adversarial Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Gábor Gosztolya,et al. Adaptation of DNN Acoustic Models Using KL-divergence Regularization and Multi-task Training , 2016, SPECOM.

[30] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[31] Steve Renals,et al. Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation , 2016, IEEE ACM Trans. Audio Speech Lang. Process..

[32] Bhuvana Ramabhadran,et al. Invariant Representations for Noisy Speech Recognition , 2016, ArXiv.

[33] I-Fan Chen,et al. Maximum a posteriori adaptation of network parameters in deep models , 2015, INTERSPEECH.

[34] Xiong Xiao,et al. Developing Far-Field Speaker System Via Teacher-Student Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Yifan Gong,et al. Advancing Acoustic-to-Word CTC Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36] Yifan Gong,et al. Adversarial Speaker Adaptation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Biing-Hwang Juang,et al. Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Zhong Meng. Discriminative and adaptive training for robust speech recognition and understanding , 2018 .

[39] Biing-Hwang Juang,et al. Cycle-Consistent Speech Enhancement , 2018, INTERSPEECH.

[40] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Hank Liao,et al. Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[43] Biing-Hwang Juang,et al. Speaker-Invariant Training Via Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44] Huaiyu Zhu. On Information and Sufficiency , 1997 .