A Study of Filter Bank Smoothing in MFCC Features for Recognition of Children's Speech

In this paper, we study the effect of filter bank smoothing on the recognition performance of children's speech. Filter bank smoothing of spectra is done during the computation of the Mel filter bank cepstral coefficients (MFCCs). We study the effect of smoothing both for the case when there is vocal-tract length normalization (VTLN) as well as for the case when there is no VTLN. The results from our experiments indicate that unlike conventional VTLN implementation, it is better not to scale the bandwidths of the filters during VTLN - only the filter center frequencies need be scaled. Our interpretation of the above result is that while the formant center frequencies may approximately scale between speakers, the formant bandwidths do not change significantly. Therefore, the scaling of filter bandwidths by a warp-factor during conventional VTLN results in differences in spectral smoothing leading to degradation in recognition performance. Similarly, results from our experiments indicate that for telephone-based speech when there is no normalization it is better to use uniform-bandwidth filters instead of the constant- like filters that are used in the computation of conventional MFCC. Our interpretation is that with constant- filters there is excessive spectral smoothing at higher frequencies which leads to degradation in performance for children's speech. However, the use of constant- filters during VTLN does not create any additional performance degradation. As we will show, during VTLN it is only important that the filter bandwidths are not scaled irrespective of whether we use constant- or uniform-bandwidth filters. With our proposed changes in the filter bank implementation we get comparable performance for adults and about 6% improvement for children both for the case of using VTLN as well as the for the case of not using VTLN on a telephone-based digit recognition task.

[1]  H. K. Dunn Methods of Measuring Vowel Formant Bandwidths , 1961 .

[2]  Hermann Ney,et al.  Speaker adaptive modeling by vocal tract normalization , 2002, IEEE Trans. Speech Audio Process..

[3]  Srinivasan Umesh,et al.  An investigation into front-end signal processing for speaker normalization , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[5]  Daniel Elenius,et al.  Comparing speech recognition for adults and children , 2004 .

[6]  P A Busby,et al.  Formant frequency values of vowels produced by preadolescent boys and girls. , 1995, The Journal of the Acoustical Society of America.

[7]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[8]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[9]  S. Whiteside,et al.  Sex-specific fundamental and formant frequency patterns in a cross-sectional study. , 2001, The Journal of the Acoustical Society of America.

[10]  Michael Picheny,et al.  Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  G. Fant Non-uniform vowel normalization , 1975 .

[12]  O. Fujimura,et al.  Sweep-tone measurements of vocal-tract characteristics. , 1971, The Journal of the Acoustical Society of America.

[13]  Leon Cohen,et al.  Scale transform in speech analysis , 1999, IEEE Trans. Speech Audio Process..

[14]  Abeer Alwan,et al.  An improved correction formula for the estimation of harmonic magnitudes and its application to open quotient estimation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Mark A. Fanty,et al.  Rapid unsupervised adaptation to children's speech on a connected-digit task , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[16]  Srinivasan Umesh,et al.  Non-uniform scaling based speaker normalization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[18]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[19]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[20]  Robert H. Mannell Formant diphone parameter extraction utilising a labelled single-speaker database , 1998, ICSLP.

[21]  A.H. Nuttall,et al.  Spectral estimation using combined time and lag weighting , 1982, Proceedings of the IEEE.

[22]  Diego Giuliani,et al.  Investigating recognition of children's speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..