Speaker Adaptive Training and Mixup Regularization for Neural Network Acoustic Models in Automatic Speech Recognition

This work investigates speaker adaptation and regularization techniques for deep neural network acoustic models (AMs) in automatic speech recognition (ASR) systems. In previous works, GMM-derived (GMMD) features have been shown to be an efficient technique for neural network AM adaptation. In this paper, we propose and investigate a novel way to improve speaker adaptive training (SAT) for neural network AMs using GMMD features. The idea is based on using inaccurate transcriptions from ASR for adaptation during neural network training, while keeping the exact transcriptions for targets of neural networks. In addition, we apply a mixup technique, recently proposed for classification tasks, to acoustic models for ASR and investigate the impact of this technique on speaker adapted acoustic models. Experimental results on the TED-LIUM corpus show that the proposed approaches provide an additional gain in speech recognition performance in comparison with the speaker adapted AMs.

[1]  Shigeru Katagiri,et al.  Speaker Adaptive Training using Deep Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[3]  Dong Yu,et al.  Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Khe Chai Sim,et al.  On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Peter Bell,et al.  Structured output layer with auxiliary targets for context-dependent acoustic modelling , 2015, INTERSPEECH.

[6]  Yifan Gong,et al.  Factorized adaptation for deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[8]  Emmanuel Vincent,et al.  DNN Uncertainty Propagation Using GMM-Derived Uncertainty Features for Noise Robust ASR , 2018, IEEE Signal Processing Letters.

[9]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[10]  Srinivasan Umesh,et al.  Speaker adaptation of convolutional neural network using speaker specific subspace vectors of SGMM , 2015, INTERSPEECH.

[11]  Tatsuya Harada,et al.  Learning from Between-class Examples for Deep Sound Recognition , 2017, ICLR.

[12]  Pietro Laface,et al.  Adaptation of Artificial Neural Networks Avoiding Catastrophic Forgetting , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[13]  Natalia A. Tomashenko,et al.  GMM-derived features for effective unsupervised adaptation of deep neural network acoustic models , 2015, INTERSPEECH.

[14]  Yiming Wang,et al.  Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[15]  Philip C. Woodland Speaker adaptation for continuous density HMMs: a review , 2001 .

[16]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Pietro Laface,et al.  Adaptation of Hybrid ANN/HMM Models Using Linear Hidden Transformations and Conservative Training , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[18]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[19]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[20]  I-Fan Chen,et al.  Feature space maximum a posteriori linear regression for adaptation of deep neural networks , 2014, INTERSPEECH.

[21]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[23]  Yongqiang Wang,et al.  Adaptation of deep neural network acoustic models using factorised i-vectors , 2014, INTERSPEECH.

[24]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[25]  Yuuki Tachioka,et al.  Feature-space structural MAPLR with regression tree-based multiple transformation matrices for DNN , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[26]  I-Fan Chen,et al.  Maximum a posteriori adaptation of network parameters in deep models , 2015, INTERSPEECH.

[27]  Natalia A. Tomashenko,et al.  Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing , 2014, INTERSPEECH.

[28]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Koichi Shinoda,et al.  Speaker adaptation of deep neural networks using a hierarchy of output layers , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[30]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[31]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[32]  Tatsuya Kawahara,et al.  Ensemble speaker modeling using speaker adaptive training deep neural network for speaker adaptation , 2015, INTERSPEECH.

[33]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Themos Stafylakis,et al.  I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Yannick Estève,et al.  On the Use of Gaussian Mixture Model Framework to Improve Speaker Adaptation of Deep Neural Network Acoustic Models , 2016, INTERSPEECH.

[36]  Hui Lin,et al.  Deep neural networks with auxiliary Gaussian mixture models for real-time speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Dmitry Popov,et al.  An Investigation of Mixup Training Strategies for Acoustic Models in ASR , 2018, INTERSPEECH.

[38]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Sree Hari Krishnan Parthasarathi,et al.  fMLLR based feature-space speaker adaptation of DNN acoustic models , 2015, INTERSPEECH.

[40]  Yannick Estève,et al.  Evaluation of Feature-Space Speaker Adaptation for End-to-End Acoustic Models , 2018, LREC.

[41]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[43]  Mark J. F. Gales,et al.  Incorporating a Generative Front-End Layer to Deep Neural Network for Noise Robust Automatic Speech Recognition , 2016, INTERSPEECH.

[44]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..