Improving children speech recognition in acoustically mismatched condition using eigenvoices and feature projections

The automatic recognition of children's speech in acoustically mismatched conditions is a challenging problem on account of large difference in adults' and children's speech. In literature, this challenge is often addressed through concatenation of various feature/model domain adaptation methods like vocal tract length normalization (VTLN), maximum likelihood linear regression (MLLR) and heteroscedastic linear discriminant analysis (HLDA). But a significant gap in the performance of adults and children still remains. This work explores the eigenvoices (EV) based adaptation for addressing the gap in recognition performance of children's speech on adults' speech trained acoustic models. EV is a fast adaptation approach and helps in an effective gender biasing of the acoustic models. On combining EV with VTLN, MLLR and HLDA, under mismatched condition an absolute improvement of about 50% over the unadapted speaker independent system performance is obtained and thus significantly reducing the gap between the performances for adults and children.

[1]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[2]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Mark J. F. Gales,et al.  The generation and use of regression class trees for MLLR adaptation , 1996 .

[4]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[5]  Piero Cosi,et al.  On the development of matched and mismatched Italian children's speech recognition systems , 2009, INTERSPEECH.

[6]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[7]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[8]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[9]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[10]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[11]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[12]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[13]  Shweta Ghai,et al.  Addressing pitch Mismatch for Children's Automatic Speech Recognition , 2011 .

[14]  Fabio Brugnara,et al.  Improved automatic speech recognition through speaker normalization , 2006, Comput. Speech Lang..

[15]  Elmar Nöth,et al.  Acoustic normalization of children's speech , 2003, INTERSPEECH.

[16]  S. Matsoukas,et al.  Improved speaker adaptation using speaker dependent feature projections , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[17]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.