Transfer learning for children's speech recognition

Children's speech processing is more challenging than that of adults due to lacking of large scale children's speech corpora. With the developing of the physical speech organ, high inter speaker and intra speaker variabilities are observed in children's speech. On the other hand, data collection on children is difficult as children usually have short attention span and their language proficiency is limited. In this paper, we propose to improve children's automatic speech recognition performance with transfer learning technique. We compare two transfer learning approaches in enhancing children's speech recognition performance with adults' data. The first method is to perform acoustic model adaptation on the pre-trained adult model. The second is to train acoustic model with deep neural network based multi-task learning approach: the adults' and children's acoustic characteristics are learnt jointly in the shared hidden layers, while the output layers are optimized with different speaker groups. Our experiment results show that both transfer learning approaches are effective in transferring rich phonetic and acoustic information from adults' model to children model. The multi-task learning approach outperforms the acoustic adaptation approach. We further show that the speakers' acoustic characteristics in languages can also benefit the target language under the multi-task learning framework.

[1]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[2]  Rong Tong,et al.  Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: Analysis on iCALL , 2016, Speech Commun..

[3]  Fabio Brugnara,et al.  Acoustic variability and automatic recognition of children's speech , 2007, Speech Commun..

[4]  Rong Tong,et al.  Multi-Task Learning for Mispronunciation Detection on Singapore Children's Mandarin Speech , 2017, INTERSPEECH.

[5]  Thomas Fang Zheng,et al.  Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[6]  Jianhua Lu,et al.  Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[7]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[8]  Chin-Hui Lee,et al.  A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition , 2016, Neurocomputing.

[9]  Julius Kunze,et al.  Transfer Learning for Speech Recognition on a Budget , 2017, Rep4NLP@ACL.

[10]  Peter Bell,et al.  Improving Children's Speech Recognition Through Out-of-Domain Data Augmentation , 2016, INTERSPEECH.

[11]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[12]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[13]  Rong Tong,et al.  SingaKids-Mandarin: Speech Corpus of Singaporean Children Speaking Mandarin Chinese , 2016, INTERSPEECH.

[14]  Shrikanth S. Narayanan,et al.  Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.