Multi-Task Learning for Mispronunciation Detection on Singapore Children's Mandarin Speech

Speech technology for children is more challenging than for adults, because there is a lack of children’s speech corpora. Moreover, there is higher heterogeneity in children’s speech due to variability in anatomy across age and gender, larger variance in speaking rate and vocal effort, and immature command of word usage, grammar, and linguistic structure. Speech productions from Singapore children possess even more variability due to the multilingual environment in the city-state, causing interinfluences from Chinese languages (e.g., Hokkien and Mandarin), English dialects (e.g., American and British), and Indian languages (e.g., Hindi and Tamil). In this paper, we show that acoustic modeling of children’s speech can leverage on a larger set of adult data. We compare two data augmentation approaches for children’s acoustic modeling. The first approach disregards the child and adult categories and consolidates the two datasets together as one entire set. The second approach is multi-task learning: during training the acoustic characteristics of adults and children are jointly learned through shared hidden layers of the deep neural network, yet they still retain their respective targets using two distinct softmax layers. We empirically show that the multi-task learning approach outperforms the baseline in both speech recognition and computer-assisted pronunciation training.

[1]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[2]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[3]  Xiaodong Liu,et al.  Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval , 2015, NAACL.

[4]  Keelan Evanini,et al.  Automated speech scoring for non-native middle school students with multiple task types , 2013, INTERSPEECH.

[5]  Rong Tong,et al.  Goodness of tone (GOT) for non-native Mandarin tone recognition , 2015, INTERSPEECH.

[6]  Jianhua Lu,et al.  Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[7]  Gökhan Tür Multitask Learning for Spoken Language Understanding , 2006, ICASSP.

[8]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[9]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Zhizheng Wu,et al.  Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning , 2015, INTERSPEECH.

[11]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[12]  Rong Tong,et al.  Context Aware Mispronunciation Detection for Mandarin Pronunciation Training , 2016, Interspeech.

[13]  Fabio Brugnara,et al.  Acoustic variability and automatic recognition of children's speech , 2007, Speech Commun..

[14]  Diego Giuliani,et al.  Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children† , 2016, Natural Language Engineering.

[15]  Kun Li,et al.  Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[17]  Dong Wang,et al.  THCHS-30 : A Free Chinese Speech Corpus , 2015, ArXiv.

[18]  Rong Tong,et al.  Tokenizing fundamental frequency variation for Mandarin tone error detection , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Rong Tong,et al.  SingaKids-Mandarin: Speech Corpus of Singaporean Children Speaking Mandarin Chinese , 2016, INTERSPEECH.

[20]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Thomas Hain,et al.  Automatic assessment of English learner pronunciation using discriminative classifiers , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Li Qun,et al.  The Effects of Bandwidth Reduction on Human and Computer Recognition of Children's Speech , 2007, IEEE Signal Processing Letters.

[23]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Su-Youn Yoon,et al.  Automatic scoring of non-native children's spoken language proficiency , 2015, SLaTE.

[25]  Peter Bell,et al.  Improving Children's Speech Recognition Through Out-of-Domain Data Augmentation , 2016, INTERSPEECH.

[26]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[27]  Shrikanth S. Narayanan,et al.  Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[28]  Rong Tong,et al.  Subspace Gaussian mixture model for computer-assisted language learning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).