Analyzing pitch robustness of PMVDR and MFCC features for children's speech recognition

The degradation in children's speech recognition performance under mismatched condition i.e., on the adults' speech trained models is a well known problem. Apart from several other factors, this degradation is also contributed by the large difference in the pitch values of the adults' and the children's speech. MFCC is the most commonly used feature in automatic speech recognition but it has been reported to be affected by the pitch variations across speech signals. Recently, perceptual-MVDR (PMVDR) feature has been reported as a better alternative to MFCC under noisy conditions. It is also attributed to possess better spectral modeling ability for high pitch signals. Motivated by these, in this work, we analyze the robustness of PMVDR to pitch variations across speech signals in comparison to MFCC for the children's speech recognition under mismatched condition. Our study finds PMVDR to be more pitch robust than MFCC using the default parameters. However, on suitable adaptation of the parameters for the children's speech recognition under mismatched condition, both PMVDR and MFCC give significantly improved comparable performances for children's speech as well as exhibit similar robustness to pitch variations.

[1]  Harald Singer,et al.  Pitch dependent phone modelling for HMM based speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Bhaskar D. Rao,et al.  All-pole modeling of speech based on the minimum variance distortionless response spectrum , 2000, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[3]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[4]  Shweta Ghai,et al.  Exploring the role of spectral smoothing in context of children's speech recognition , 2009, INTERSPEECH.

[5]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[6]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[7]  Xu Shao,et al.  Pitch prediction from MFCC vectors for speech reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Joakim Gustafson,et al.  Voice transformations for improving children²s speech recognition in a publicly available dialogue system , 2002, INTERSPEECH.

[9]  Shweta Ghai,et al.  On the use of pitch normalization for improving children's speech recognition , 2009, INTERSPEECH.

[10]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[11]  Fabio Brugnara,et al.  Acoustic variability and automatic recognition of children's speech , 2007, Speech Commun..

[12]  Luís C. Oliveira,et al.  Pitch-synchronous time-scaling for prosodic and voice quality transformations , 2005, INTERSPEECH.

[13]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[14]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.