Mismatched training data enhancement for automatic recognition of children's speech using DNN-HMM

The increasing profusion of commercial automatic speech recognition technology applications has been driven by big-data techniques, using high quality labelled speech datasets. Children's speech has greater time and frequency domain variability than typical adult speech, lacks good large scale training data, and presents difficulties relating to capture quality. Each of these factors reduces the performance of systems that automatically recognise children's speech. In this paper, children's speech recognition is investigated using a hybrid acoustic modelling approach based on deep neural networks and Gaussian mixture models with hidden Markov model back ends. We explore the incorporation of mismatched training data to achieve a better acoustic model and improve performance in the face of limited training data, as well as training data augmentation using noise. We also explore two arrangements for vocal tract length normalisation and a gender-based data selection technique suitable for training a children's speech recogniser.

[1]  Ian McLoughlin,et al.  Speech recognition for smart homes , 2008 .

[2]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[3]  Fabio Brugnara,et al.  Acoustic variability and automatic recognition of children's speech , 2007, Speech Commun..

[4]  Raymond D. Kent,et al.  Development of vocal tract length during early childhood: a magnetic resonance imaging study. , 2005, The Journal of the Acoustical Society of America.

[5]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[6]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[7]  Martin J. Russell,et al.  The STAR system: an interactive pronunciation tutor for young children , 2000, Comput. Speech Lang..

[8]  Jianhua Lu,et al.  Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[9]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[10]  Shrikanth S. Narayanan,et al.  Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[11]  Mirjam Wester,et al.  Pronunciation modeling for ASR - knowledge-based and data-derived methods , 2003, Comput. Speech Lang..

[12]  W. Fitch,et al.  Morphology and development of the human vocal tract: a study using magnetic resonance imaging. , 1999, The Journal of the Acoustical Society of America.

[13]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[14]  Ian McLoughlin,et al.  Speech and Audio Processing: A MATLAB®-based Approach , 2016 .

[15]  Susanne Schötz,et al.  A perceptual study of speaker age , 2009 .

[16]  Shweta Ghai,et al.  Pitch adaptive MFCC features for improving children’s mismatched ASR , 2015, International Journal of Speech Technology.