Large vocabulary automatic speech recognition for children

Recently, Google launched YouTube Kids, a mobile application for children, that uses a speech recognizer built specifically for recognizing children’s speech. In this paper we present techniques we explored to build such a system. We describe the use of a neural network classifier to identify matched acoustic training data, filtering data for language modeling to reduce the chance of producing offensive results. We also compare long short-term memory (LSTM) recurrent networks to convolutional, LSTM, deep neural networks (CLDNN). We found that a CLDNN acoustic model outperforms an LSTM across a variety of different conditions, but does not specifically model child speech relatively better than adult. Overall, these findings allow us to build a successful, state-of-the-art large vocabulary speech recognizer for both children and adults.

[1]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Michael Picheny,et al.  Context Dependent Modeling of Phones in Continuous Speech Using Decision Trees , 1991, HLT.

[3]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[4]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[7]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[8]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[9]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[10]  Kevyn Collins-Thompson,et al.  A Language Modeling Approach to Predicting Reading Difficulty , 2004, NAACL.

[11]  Daniel Elenius,et al.  Adaptation and normalization experiments in speech recognition for 4 to 8 year old children , 2005, INTERSPEECH.

[12]  Li Qun,et al.  The Effects of Bandwidth Reduction on Human and Computer Recognition of Children's Speech , 2007, IEEE Signal Processing Letters.

[13]  Martin J. Russell,et al.  Challenges for computer recognition of children2s speech , 2007, SLaTE.

[14]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[15]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[16]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[17]  Cyril Allauzen,et al.  Bayesian Language Model Interpolation for Mobile Speech Input , 2011, INTERSPEECH.

[18]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[19]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[24]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[26]  Isabel Trancoso,et al.  Correlating ASR errors with developmental changes in speech production: a study of 3-10-year-old European Portuguese children's speech , 2014, WOCCI.

[27]  Jianhua Lu,et al.  Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[28]  Georg Heigold,et al.  Asynchronous stochastic optimization for sequence training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Shrikanth S. Narayanan,et al.  Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[30]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).