Improving children's mismatched ASR using structured low-rank feature projection

Abstract The work presented in this paper explores the issues in automatic speech recognition (ASR) of children’s speech on acoustic models trained on adults’ speech. In such contexts, due to a large acoustic mismatch between training and test data, highly degraded recognition rates are noted. Even with the use of vocal tract length normalization (VTLN), the mismatched case recognition performance is still much below that for the matched case. Our earlier studies have shown that, for commonly used mel-filterbank-based cepstral features, the acoustic mismatch is exacerbated by insufficient smoothing of pitch harmonics for child speakers. To address this problem, a structured low-rank projection of the features vectors prior to learning the acoustic models as well as before decoding is proposed in this paper. To accomplish this, first a low-rank transform is learned on the training data (adults’ speech). Any dimensionality reduction technique which depends on the variance of the training data may be used for this purpose. In this work, principal component analysis and heteroscedastic linear discriminant analysis have been explored for the same. When the derived low-rank projection is applied in the mismatched testing case, it alleviates the pitch-dependent mismatch. The proposed approach provides a relative recognition performance improvement of 35% over the VTLN included baseline for the children’s mismatched ASR employing acoustic modeling based on hidden Markov models (HMM) with observation densities modeled using Gaussian mixture models (GMM). In addition to that, other acoustic modeling approaches based on subspace GMM (SGMM) and deep neural networks (DNN) have also been explored. Projecting the data to a lower-dimensional subspace is found to be effective in those frameworks as well. In the case of SGMM and DNN-based systems, the proposed approach is noted to result in relative recognition performance improvements of 33% and 21%, respectively, over their corresponding baselines.

[1]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[2]  Harald Singer,et al.  Pitch dependent phone modelling for HMM based speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .

[4]  Shweta Ghai,et al.  Exploring the Effect of Differences in the Acoustic Correlates of Adults' and Children's Speech in the Context of Automatic Speech Recognition , 2010, EURASIP J. Audio Speech Music. Process..

[5]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Joakim Gustafson,et al.  Children's convergence in referring expressions to graphical objects in a speech-enabled computer game , 2007, INTERSPEECH.

[7]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.

[8]  I. Hirsh,et al.  Development of speech sounds in children. , 1969, Acta oto-laryngologica. Supplementum.

[9]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[10]  Shrikanth S. Narayanan,et al.  Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[11]  Li Qun,et al.  The Effects of Bandwidth Reduction on Human and Computer Recognition of Children's Speech , 2007, IEEE Signal Processing Letters.

[12]  Diego Giuliani,et al.  Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children† , 2016, Natural Language Engineering.

[13]  Fabio Brugnara,et al.  Towards age-independent acoustic modeling , 2009, Speech Commun..

[14]  Jianhua Lu,et al.  Child automatic speech recognition for US English: child interaction with living-room-electronic-devices , 2014, WOCCI.

[15]  S. Shahnawazuddin,et al.  Enhancing the recognition of children's speech on acoustically mismatched ASR system , 2015, TENCON 2015 - 2015 IEEE Region 10 Conference.

[16]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[17]  Diego Giuliani,et al.  Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[18]  Mark A. Fanty,et al.  Rapid unsupervised adaptation to children's speech on a connected-digit task , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[19]  Fabio Brugnara,et al.  Acoustic variability and automatic recognition of children's speech , 2007, Speech Commun..

[20]  Bryan L. Pellom,et al.  Children's speech recognition with application to interactive books and tutors , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[21]  Maryam Najafian,et al.  Comparison of speaker verification performance for adult and child speech , 2014, WOCCI.

[22]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[23]  Martin J. Russell,et al.  Challenges for computer recognition of children2s speech , 2007, SLaTE.

[24]  Ronald A. Cole,et al.  Highly accurate children's speech recognition for interactive reading tutors using subword units , 2007, Speech Commun..

[25]  Jian Cheng,et al.  Using deep neural networks to improve proficiency assessment for children English language learners , 2014, INTERSPEECH.

[26]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[27]  Joakim Gustafson,et al.  Voice transformations for improving children²s speech recognition in a publicly available dialogue system , 2002, INTERSPEECH.

[28]  Xu Shao,et al.  Pitch prediction from MFCC vectors for speech reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[30]  I. Jolliffe Principal Component Analysis , 2002 .

[31]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[32]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[33]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[34]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[35]  Raymond D. Kent,et al.  Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. , 1976, Journal of speech and hearing research.

[36]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[37]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[38]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[39]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[40]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[41]  Maryam Najafian,et al.  Speaker Recognition for Children's Speech , 2016, INTERSPEECH.

[42]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[43]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[44]  James R. Glass,et al.  A comparison of novel techniques for instantaneous speaker adaptation , 1997, EUROSPEECH.

[45]  Fabio Brugnara,et al.  Integration of Heteroscedastic Linear Discriminant Analysis (HLDA) Into Adaptive Training , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[46]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[47]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[48]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[49]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[50]  Kiyohiro Shikano,et al.  Public speech-oriented guidance system with adult and child discrimination capability , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.