Unsupervised Feature Adaptation Using Adversarial Multi-Task Training for Automatic Evaluation of Children's Speech

Processing children’s speech is challenging due to high speaker variability arising from vocal tract size and scarce amounts of publicly available linguistic resources. In this work, we tackle such challenges by proposing an unsupervised feature adaptation approach based on adversarial multi-task training in a neural framework. A front-end feature transformation module is positioned prior to an acoustic model trained on adult speech (1) to leverage on the readily available linguistic resources on adult speech or existing models, and (2) to reduce the acoustic mismatch between child and adult speech. Experimental results demonstrate that our proposed approach consistently outperforms established baselines trained on adult speech across a variety of tasks ranging from speech recognition to pronunciation assessment and fluency score prediction.

[1]  Byoung-Tak Zhang,et al.  Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Lei Xie,et al.  Domain Adversarial Training for Improving Keyword Spotting Performance of ESL Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  David Suendermann-Oeft,et al.  Self-Adaptive DNN for Improving Spoken Language Proficiency Assessment , 2016, INTERSPEECH.

[5]  Diego Giuliani,et al.  Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children† , 2016, Natural Language Engineering.

[6]  Yu Wang,et al.  Towards automatic assessment of spontaneous spoken English , 2018, Speech Commun..

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Yusuke Shinohara,et al.  Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition , 2016, INTERSPEECH.

[9]  Jianhua Lu,et al.  Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[10]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[11]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[12]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[13]  Tatsuya Kawahara,et al.  Joint Optimization of Denoising Autoencoder and DNN Acoustic Model Based on Multi-Target Learning for Noisy Speech Recognition , 2016, INTERSPEECH.

[14]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15]  Fabio Brugnara,et al.  Acoustic variability and automatic recognition of children's speech , 2007, Speech Commun..

[16]  Jun Du,et al.  Joint training of front-end and back-end deep neural networks for robust speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Diego Giuliani,et al.  Non-Native Children Speech Recognition Through Transfer Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Xiaoming Xi,et al.  Automatic scoring of non-native spontaneous speech in tests of spoken English , 2009, Speech Commun..

[19]  Shrikanth S. Narayanan,et al.  Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[20]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[21]  Mark J. F. Gales,et al.  Disfluency Detection for Spoken Learner English , 2019, SLaTE.

[22]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[23]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Martin J. Russell,et al.  Challenges for computer recognition of children2s speech , 2007, SLaTE.

[26]  Jian Cheng,et al.  Using deep neural networks to improve proficiency assessment for children English language learners , 2014, INTERSPEECH.

[27]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[28]  Kristin Precoda,et al.  The SRI EduSpeak System: Recognition and Pronunciation Scoring for Language Learning , 2007 .

[29]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[30]  Biing-Hwang Juang,et al.  Speaker-Invariant Training Via Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Klaus Zechner,et al.  Using bidirectional lstm recurrent neural networks to learn high-level abstractions of sequential features for automated scoring of non-native spontaneous speech , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[32]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[33]  Florian Metze,et al.  Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Helmer Strik,et al.  Automatic evaluation of Dutch pronunciation by using speech recognition technology , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.