Acoustic variability and automatic recognition of children's speech

This paper presents several acoustic analyses carried out on read speech collected from Italian children aged from 7 to 13 years and North American children aged from 5 to 17 years. These analyses aimed at achieving a better understanding of spectral and temporal changes in speech produced by children of various ages in view of the development of automatic speech recognition applications. The results of these analyses confirm and complement the results reported in the literature, showing that characteristics of children's speech change with age and that spectral and temporal variability decrease as age increases. In fact, younger children show a substantially higher intra- and inter-speaker variability with respect to older children and adults. We investigated the use of several methods for speaker adaptive acoustic modeling to cope with inter-speaker spectral variability and to improve recognition performance for children. These methods proved to be effective in recognition of read speech with a vocabulary of about 11k words.

[1]  Hermann Ney,et al.  Improved methods for vocal tract normalization , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  Diego Giuliani,et al.  Investigating recognition of children's speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Maurizio Omologo,et al.  Speaker independent continuous speech recognition using an acoustic-phonetic Italian corpus , 1994, ICSLP.

[4]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[5]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[6]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[7]  Etienne Barnard,et al.  Phone clustering using the Bhattacharyya distance , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Martin J. Russell,et al.  The STAR system: an interactive pronunciation tutor for young children , 2000, Comput. Speech Lang..

[9]  Jing Zheng,et al.  Word-level rate of speech modeling using rate-specific phones and pronunciations , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[12]  Fabio Brugnara,et al.  Issues in automatic transcription of historical audio data , 2002, INTERSPEECH.

[13]  Ronald A. Cole,et al.  Advances in Children's Speech Recognition within an Interactive Literacy Tutor , 2004, HLT-NAACL.

[14]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16]  Fabio Brugnara,et al.  Speaker adaptive acoustic modeling with mixture of adult and children's speech , 2005, INTERSPEECH.

[17]  Haizhou Li,et al.  Multilingual speech recognition: a unified approach , 2005, INTERSPEECH.

[18]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[19]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[20]  H. Wakita Normalization of vowels by vocal-tract length and its application to vowel identification , 1977 .

[21]  Giampiero Salvi Accent clustering in Swedish using the Bhattacharyya distance , 2003 .

[22]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[23]  Bryan L. Pellom,et al.  Children's speech recognition with application to interactive books and tutors , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[24]  Diego Giuliani,et al.  Parling, a CALL system for children , 2004 .

[25]  Heinrich Niemann,et al.  Speedata: a prototype for multilingual spoken data-entry , 1997, EUROSPEECH.

[26]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[27]  K. Johnson,et al.  Formants of children, women, and men: the effects of vocal intensity variation. , 1999, The Journal of the Acoustical Society of America.

[28]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[29]  Sungbok Lee,et al.  Creation of two children's speech databases , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[30]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[31]  Kiyohiro Shikano,et al.  Public speech-oriented guidance system with adult and child discrimination capability , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Shrikanth S. Narayanan,et al.  Politeness and frustration language in child-machine interactions , 2001, INTERSPEECH.

[33]  Kohji Fukunaga,et al.  Introduction to Statistical Pattern Recognition-Second Edition , 1990 .

[34]  Sandra P. Whiteside,et al.  Speech patterns of children and adults elicited via a picture-naming task: An acoustic study , 2000, Speech Commun..

[35]  W. Fitch,et al.  Morphology and development of the human vocal tract: a study using magnetic resonance imaging. , 1999, The Journal of the Acoustical Society of America.

[36]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[37]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[38]  Michael Picheny,et al.  Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[39]  D. Cooke,et al.  A Basic Course in Statistics , 1980 .

[40]  Joakim Gustafson,et al.  Voice transformations for improving children²s speech recognition in a publicly available dialogue system , 2002, INTERSPEECH.

[41]  Louis ten Bosch,et al.  A novel feature transformation for vocal tract length normalization in automatic speech recognition , 1998, IEEE Trans. Speech Audio Process..

[42]  Eric Fosler-Lussier,et al.  Towards robustness to fast speech in ASR , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[43]  Mark A. Fanty,et al.  Rapid unsupervised adaptation to children's speech on a connected-digit task , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[44]  Fabio Brugnara,et al.  From broadcast news to spontaneous dialogue transcription: portability issues , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[45]  Satanjeev Banerjee,et al.  Evaluating the effect of predicting oral reading miscues , 2003, INTERSPEECH.

[46]  Sadaoki Furui,et al.  Analysis of spectral space reduction in spontaneous speech and its effects on speech recognition performances , 2005, INTERSPEECH.

[47]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[48]  D. Cooke,et al.  A Basic Course in Statistics , 2000 .

[49]  Fabio Brugnara,et al.  Improved automatic speech recognition through speaker normalization , 2006, Comput. Speech Lang..

[50]  Forbes Ave. Pittsburgh,et al.  PINPOINTING PRONUNCIATION ERRORS IN CHILDREN ’ S SPEECH : EXAMINING THE ROLE OF THE SPEECH RECOGNIZER , 2000 .