Enhancing the recognition of children's speech on acoustically mismatched ASR system

The work presented in this paper explores the issues of recognizing children's speech using acoustic models trained on adults' speech data. In such conditions, on account of large acoustic mismatch between training and test data, a high degradation in the recognition performance is noted. In our earlier work, a binary weighting of cepstral features as well as of acoustic model parameters was explored to address the same. In this paper, a soft-weighting is proposed to overcome the information loss with simple binary weighting scheme. This is achieved through a low-rank projection learned using adults' training data. The so derived transform happens to emphasize the principal dimensions of acoustic variations in adults' speech. During testing, the transform maps children's test data to the space of the training data and thus suppresses the mismatched dimensions. The proposed scheme is also verified experimentally using a recognition system trained on adults' data only as well as another system trained using adults' and children's data pooled together. The effectiveness of acoustic model adaptation is also explored to further enhance the system performance. Combining SW with cluster model interpolation leads to a relative improvement of 14% over the baseline.

[1]  I. Hirsh,et al.  Development of speech sounds in children. , 1969, Acta oto-laryngologica. Supplementum.

[2]  Fabio Brugnara,et al.  Integration of Heteroscedastic Linear Discriminant Analysis (HLDA) Into Adaptive Training , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[4]  Michael Picheny,et al.  Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[6]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.

[7]  Shweta Ghai,et al.  Addressing pitch Mismatch for Children's Automatic Speech Recognition , 2011 .

[8]  Harald Singer,et al.  Pitch dependent phone modelling for HMM-based speech recognition , 1994 .

[9]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[10]  Raymond D. Kent,et al.  Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. , 1976, Journal of speech and hearing research.

[11]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[12]  Joakim Gustafson,et al.  Voice transformations for improving children²s speech recognition in a publicly available dialogue system , 2002, INTERSPEECH.

[13]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[14]  Mark A. Fanty,et al.  Rapid unsupervised adaptation to children's speech on a connected-digit task , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16]  Xu Shao,et al.  Pitch prediction from MFCC vectors for speech reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[18]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[19]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[20]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[21]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..