A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models

This study explores issues in automatic speech recognition (ASR) of children’s speech on acoustic models trained using adults’ speech. For acoustic modeling in ASR, the employed front-end features capture the characteristics of the vocal filter while smoothing out those of the source (excitation). Adults’ and children’s speech differ significantly due to large deviation in the acoustic correlates such as pitch, formants, speaking rate, etc. In the context of children’s speech recognition on mismatched acoustic models, the recognition rates remain highly degraded despite use of the vocal tract length normalization (VTLN) for addressing formant mismatch. For commonly used mel-filterbank-based cepstral features, earlier studies have shown that the acoustic mismatch is exacerbated by insufficient smoothing of pitch harmonics for child speakers. To address this problem, a structured low-rank projection of the test features as well as that of the mean and the covariance parameters of the acoustic models was explored in an earlier work. In this paper, a low-latency adaptation scheme is presented for children’s mismatched ASR. The presented fast adaptation approach exploits the earlier reported low-rank projection technique in order to reduce the computational cost. In the proposed approach, developmental data from the children’s domain is partitioned into separate groups on the basis of their estimated VTLN warp factors. A set of adapted acoustic models is then created by combining the low-rank projection with the model space adaptation technique for each of the warp factors. Given the children’s test utterance, first an appropriate pre-adapted model mean supervector is chosen based on its estimated warp factor. The chosen supervector is then optimally scaled. Consequently, only two parameters are required to be estimated, i.e., a warp factor and a model mean scaling factor. Even with such stringent constraints, the proposed adaptation technique results in a relative improvement of about $$44\%$$44% over the VTLN included baseline.

[1]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[2]  Harald Singer,et al.  Pitch dependent phone modelling for HMM based speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[4]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[5]  Ronald A. Cole,et al.  Highly accurate children's speech recognition for interactive reading tutors using subword units , 2007, Speech Commun..

[6]  Joakim Gustafson,et al.  Voice transformations for improving children²s speech recognition in a publicly available dialogue system , 2002, INTERSPEECH.

[7]  Peter Bell,et al.  Improving Children's Speech Recognition Through Out-of-Domain Data Augmentation , 2016, INTERSPEECH.

[8]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[9]  Jianhua Lu,et al.  Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[10]  Syed Shahnawazuddin,et al.  Low-memory fast on-line adaptation for acoustically mismatched children's speech recognition , 2015, INTERSPEECH.

[11]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[12]  Kai Yu,et al.  Cluster Adaptive Training for Deep Neural Network Based Acoustic Model , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[14]  Hui Jiang,et al.  Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition , 2013, INTERSPEECH.

[15]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[16]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[17]  Kiyohiro Shikano,et al.  Public speech-oriented guidance system with adult and child discrimination capability , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[19]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[20]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[21]  Mark A. Fanty,et al.  Rapid unsupervised adaptation to children's speech on a connected-digit task , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[22]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Shweta Ghai,et al.  Exploring the Effect of Differences in the Acoustic Correlates of Adults' and Children's Speech in the Context of Automatic Speech Recognition , 2010, EURASIP J. Audio Speech Music. Process..

[24]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[25]  Bryan L. Pellom,et al.  Children's speech recognition with application to interactive books and tutors , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[26]  S. Shahnawazuddin,et al.  Enhancing the recognition of children's speech on acoustically mismatched ASR system , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[27]  I. Jolliffe Principal Component Analysis , 2002 .

[28]  Nick Neave,et al.  Relationships between vocal characteristics and body size and shape in human males: An evolutionary explanation for a deep male voice , 2006, Biological Psychology.

[29]  S. Shahnawazuddin,et al.  Exploring HLDA based transformation for reducing acoustic mismatch in context of children speech recognition , 2014, 2014 International Conference on Signal Processing and Communications (SPCOM).

[30]  Julio González,et al.  Formant frequencies and body size of speaker: a weak relationship in adult humans , 2004, J. Phonetics.

[31]  Li-Rong Dai,et al.  Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  James R. Glass,et al.  A comparison of novel techniques for instantaneous speaker adaptation , 1997, EUROSPEECH.

[33]  Syed Shahnawazuddin,et al.  Improved Bases Selection in Acoustic Model Interpolation for Fast On-Line Adaptation , 2014, IEEE Signal Processing Letters.

[34]  Shweta Ghai,et al.  Addressing pitch Mismatch for Children's Automatic Speech Recognition , 2011 .

[35]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[36]  Xu Shao,et al.  Pitch prediction from MFCC vectors for speech reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[38]  Joakim Gustafson,et al.  Children's convergence in referring expressions to graphical objects in a speech-enabled computer game , 2007, INTERSPEECH.

[39]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.