论文信息 - Large vocabulary automatic speech recognition for children

Large vocabulary automatic speech recognition for children

Recently, Google launched YouTube Kids, a mobile application for children, that uses a speech recognizer built specifically for recognizing children’s speech. In this paper we present techniques we explored to build such a system. We describe the use of a neural network classifier to identify matched acoustic training data, filtering data for language modeling to reduce the chance of producing offensive results. We also compare long short-term memory (LSTM) recurrent networks to convolutional, LSTM, deep neural networks (CLDNN). We found that a CLDNN acoustic model outperforms an LSTM across a variety of different conditions, but does not specifically model child speech relatively better than adult. Overall, these findings allow us to build a successful, state-of-the-art large vocabulary speech recognizer for both children and adults.

[1] E. A. Martin,et al. Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] Michael Picheny,et al. Context Dependent Modeling of Phones in Continuous Speech Using Decision Trees , 1991, HLT.

[3] S. J. Young,et al. Tree-based state tying for high accuracy acoustic modelling , 1994 .

[4] Herbert Gish,et al. A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5] S. Wegmann,et al. Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6] Shrikanth S. Narayanan,et al. Automatic speech recognition for children , 1997, EUROSPEECH.

[7] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[8] Li Lee,et al. A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[9] Hideki Kawahara,et al. YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[10] Kevyn Collins-Thompson,et al. A Language Modeling Approach to Predicting Reading Difficulty , 2004, NAACL.

[11] Daniel Elenius,et al. Adaptation and normalization experiments in speech recognition for 4 to 8 year old children , 2005, INTERSPEECH.

[12] Li Qun,et al. The Effects of Bandwidth Reduction on Human and Computer Recognition of Children's Speech , 2007, IEEE Signal Processing Letters.

[13] Martin J. Russell,et al. Challenges for computer recognition of children2s speech , 2007, SLaTE.

[14] Thorsten Brants,et al. Large Language Models in Machine Translation , 2007, EMNLP.

[15] Shweta Ghai,et al. Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[16] Shrikanth S. Narayanan,et al. A review of ASR technologies for children's speech , 2009, WOCCI.

[17] Cyril Allauzen,et al. Bayesian Language Model Interpolation for Mobile Speech Input , 2011, INTERSPEECH.

[18] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[19] Gerald Penn,et al. Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Tara N. Sainath,et al. Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Ebru Arisoy,et al. Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22] Tara N. Sainath,et al. Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[24] Sanjeev Khudanpur,et al. A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Diego Giuliani,et al. Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[26] Isabel Trancoso,et al. Correlating ASR errors with developmental changes in speech production: a study of 3-10-year-old European Portuguese children's speech , 2014, WOCCI.

[27] Jianhua Lu,et al. Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[28] Georg Heigold,et al. Asynchronous stochastic optimization for sequence training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Shrikanth S. Narayanan,et al. Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[30] Tara N. Sainath,et al. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).