Pitch adaptive MFCC features for improving children’s mismatched ASR

A pitch normalization algorithm is proposed for addressing the pitch mismatch between adults’ and children’s speech for children’s automatic speech recognition (ASR). Motivated by the appearance of pitch-dependent distortions in the smoothed mel spectral envelope for high-pitched children’s speech, the algorithm modifies the mel filterbank during MFCC feature extraction to improve ASR performance. Relative improvements of 16 % and 9 % are obtained over the corresponding baseline in children’s mismatched ASR performance on a connected-digit recognition task and a continuous speech recognition task. The improvements obtained in ASR performance with the proposed pitch normalization algorithm are also found to be additive to that obtained with existing speaker normalization techniques, VTLN and CMLLR.

[1]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[2]  Bryan L. Pellom,et al.  Children's speech recognition with application to interactive books and tutors , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[3]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  F. Frome,et al.  Talking back to big bird: Preschool users and a simple speech recognition system , 1993 .

[5]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[6]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[7]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[8]  Fabio Brugnara,et al.  Improved automatic speech recognition through speaker normalization , 2006, Comput. Speech Lang..

[9]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[10]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[11]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[12]  Rohit Sinha,et al.  Analyzing pitch robustness of PMVDR and MFCC features for children's speech recognition , 2010, 2010 International Conference on Signal Processing and Communications (SPCOM).

[13]  Richard M. Stern,et al.  Speech in Noisy Environments: robust automatic segmentation, feature extraction, and hypothesis combination , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[14]  Steve Renals,et al.  Combining Spectral Representations for Large-Vocabulary Continuous Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Daniel Elenius,et al.  Adaptation and normalization experiments in speech recognition for 4 to 8 year old children , 2005, INTERSPEECH.

[16]  Diego Giuliani,et al.  Preliminary Investigations in Automatic Recognition of English Sentences Uttered by Italian Children , 2004 .

[17]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[18]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[19]  Li Bo,et al.  Speaker recognition based on dynamic MFCC parameters , 2009, 2009 International Conference on Image Analysis and Signal Processing.

[20]  Martin J. Russell,et al.  Applications of automatic speech recognition to speech and language development in young children , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[21]  Ludek Müller,et al.  Comparison of MFCC and PLP parameterizations in the speaker independent continuous speech recognition task , 2001, INTERSPEECH.

[22]  Jack Mostow,et al.  A Prototype Reading Coach that Listens , 1994, AAAI.

[23]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[24]  John G. Harris,et al.  Human factor cepstral coefficients , 2002 .

[25]  Ronald A. Cole,et al.  Highly accurate children's speech recognition for interactive reading tutors using subword units , 2007, Speech Commun..

[26]  Joakim Gustafson,et al.  Voice transformations for improving children²s speech recognition in a publicly available dialogue system , 2002, INTERSPEECH.

[27]  Mark D Skowronski,et al.  Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition. , 2004, The Journal of the Acoustical Society of America.

[28]  Elmar Nöth,et al.  Acoustic normalization of children's speech , 2003, INTERSPEECH.

[29]  Diego Giuliani,et al.  Investigating recognition of children's speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[30]  Martin J. Russell,et al.  The STAR system: an interactive pronunciation tutor for young children , 2000, Comput. Speech Lang..

[31]  Ronald A. Cole,et al.  Advances in Children's Speech Recognition within an Interactive Literacy Tutor , 2004, HLT-NAACL.

[32]  Michael Picheny,et al.  Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[33]  O. Mich,et al.  A study on the use of a voice interactive system for teaching English to Italian children , 2003, Proceedings 3rd IEEE International Conference on Advanced Technologies.

[34]  Martin J. Russell,et al.  Why is automatic recognition of children's speech difficult? , 2001, INTERSPEECH.

[35]  Shweta Ghai,et al.  Enhancing children's speech recognition under mismatched condition by explicit acoustic normalization , 2010, INTERSPEECH.

[36]  Shweta Ghai,et al.  Exploring the Effect of Differences in the Acoustic Correlates of Adults' and Children's Speech in the Context of Automatic Speech Recognition , 2010, EURASIP J. Audio Speech Music. Process..

[37]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[38]  Dick Wilson,et al.  Hong Kong! Hong Kong! , 1990 .

[39]  Mark A. Fanty,et al.  Rapid unsupervised adaptation to children's speech on a connected-digit task , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[40]  Fabio Brugnara,et al.  Acoustic variability and automatic recognition of children's speech , 2007, Speech Commun..

[41]  Goutam Saha,et al.  Capturing Complementary Information via Reversed Filter Bank and Parallel Implementation with MFCC for Improved Text-Independent Speaker Identification , 2007, 2007 International Conference on Computing: Theory and Applications (ICCTA'07).

[42]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[43]  Christian Hacker,et al.  Revising Perceptual Linear Prediction (PLP) , 2005, INTERSPEECH.

[44]  Piero Cosi,et al.  Italian children's speech recognition for advanced interactive literacy tutors , 2005, INTERSPEECH.

[45]  Daniel Elenius,et al.  Comparing speech recognition for adults and children , 2004 .

[46]  Joakim Gustafson,et al.  Children's convergence in referring expressions to graphical objects in a speech-enabled computer game , 2007, INTERSPEECH.

[47]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.

[48]  Harald Singer,et al.  Pitch dependent phone modelling for HMM based speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[49]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[50]  M. Eskénazi KIDS: A database of children’s speech , 1996 .

[51]  Kiyohiro Shikano,et al.  Public speech-oriented guidance system with adult and child discrimination capability , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  Eduardo López Gonzalo,et al.  Mel, linear, and antimel frequency cepstral coefficients in broad phonetic regions for telephone speaker recognition , 2009, INTERSPEECH.

[53]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .