Teach an All-rounder with Experts in Different Domains

In many automatic speech recognition (ASR) tasks, an ideal model has to be applicable over multiple domains. In this paper, we propose to teach an all-rounder with experts in different domains. Concretely, we build a multi-domain acoustic model by applying the teacher-student training framework. First, for each domain, a teacher model (domain-dependent model) is trained by fine-tuning a multi-condition model with domain-specific subset. Then all these teacher models are used to teach one single student model simultaneously. We perform experiments on two predefined domain setups. One is domains with different speaking styles, the other is near-field, far-field and far-field with noise. Moreover, two types of models are examined: deep feedforward sequential memory network (DFSMN) and long short term memory (LSTM). Experimental results show that the model trained with this framework outperforms not only multi-condition model but also domain-dependent model. Specially, our training method provides up to 10.4% relative character error rate improvement over baseline model (multi-condition model).

[1]  Hao Tang,et al.  A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition , 2018, INTERSPEECH.

[2]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[4]  Parisa Haghani,et al.  Syllable-based acoustic modeling with CTC-SMBR-LSTM , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Petr Motlícek,et al.  Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[7]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[8]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[9]  D. B. Paul,et al.  Speaker stress-resistant continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[10]  Tara N. Sainath,et al.  Domain Adaptation Using Factorized Hidden Layer for Robust Automatic Speech Recognition , 2018, INTERSPEECH.

[11]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[12]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[14]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[15]  Dacheng Tao,et al.  Learning from Multiple Teacher Networks , 2017, KDD.

[16]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[17]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Qiang Huo,et al.  Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yifan Gong,et al.  A comparative analytic study on the Gaussian mixture and context dependent deep neural network hidden Markov models , 2014, INTERSPEECH.

[20]  Shiliang Zhang,et al.  Deep-FSMN for Large Vocabulary Continuous Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[22]  Arun Narayanan,et al.  Toward Domain-Invariant Speech Recognition via Large Scale Training , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).