Experiments on children's speech recognition under acoustically mismatched conditions

In this paper, we have explored the effectiveness of some of the existing acoustic features for the task of recognizing children's speech using acoustic models trained on adults' speech. Among the explored features, the Mel-frequency cepstral coefficients (MFCC) and the perceptual linear prediction cepstral coefficients (PLPCC) are the most commonly used ones in speech recognition. The third feature explored in this work is the one based on normalized first-order spectral moments (SMAC-features). The SMAC-features have not been explored for such mismatched ASR task as reported in the presented work. Due to large acoustic mismatch that exists between the training and test data, the recognition performance is observed to be highly degraded in all the explored cases. At the same time, the SMAC-features are noted to be superior to the other two. The same has been verified experimentally in this paper under clean as well as noisy test conditions. To address the acoustic mismatch, a low-rank projection is learned on the adults' training data using heteroscedastic linear discriminant analysis. The derived transform happens to emphasize the principal dimensions of acoustic variations in the adults' speech. The low-rank projection is then applied to both the training and test data. During testing, the transform maps children's test data to the space of the training data and thus alleviates the acoustic mismatch. The low-rank projection is found to result in significant improvements in the recognition performance when applied to the SMAC-features.

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Shrikanth S. Narayanan,et al.  Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[3]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[4]  S. Shahnawazuddin,et al.  Enhancing the recognition of children's speech on acoustically mismatched ASR system , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[5]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[6]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[7]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[8]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[9]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[10]  S. S. Stevens On the psychophysical law. , 1957, Psychological review.

[11]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[12]  Shweta Ghai,et al.  Exploring the Effect of Differences in the Acoustic Correlates of Adults' and Children's Speech in the Context of Automatic Speech Recognition , 2010, EURASIP J. Audio Speech Music. Process..

[13]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[14]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[16]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[17]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[18]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[19]  Syed Shahnawazuddin,et al.  Low-memory fast on-line adaptation for acoustically mismatched children's speech recognition , 2015, INTERSPEECH.

[20]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.

[21]  Dimitrios Dimitriadis,et al.  Spectral Moment Features Augmented by Low Order Cepstral Coefficients for Robust ASR , 2010, IEEE Signal Processing Letters.

[22]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[23]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..