Automatic estimation of the first two subglottal resonances in children's speech with application to speaker normalization in limited-data conditions

This paper proposes an automatic algorithm for estimating the first two subglottal resonances (SGRs)—Sg1 and Sg2— from continuous speech of children, and applies it to automatic speaker normalization in mismatched, limited-data conditions. The proposed algorithm is based on the observation that Sg1 and Sg2 form phonological vowel feature boundaries, and is motivated by our recent SGR estimation algorithm for adults. The algorithm is trained and evaluated, respectively, on 25 and 9 children, aged between 7 and 18 years. The average RMS errors incurred in estimating Sg1 and Sg2 are 55 and 144 Hz, respectively. By applying the proposed algorithm to a connected digits speech recognition task, it is shown that: 1) a linear frequency warping using Sg1 or Sg2 is comparable to or better than maximum likelihood-based vocal tract length normalization (MLVTLN), 2) the performance of SGR-based frequency warping is less content dependent than that of ML-VTLN, and 3) SGRbased frequency warping can be integrated into ML-VTLN to yield a statistically-significant improvement in performance.

[1]  Abeer Alwan,et al.  Automatic height estimation using the second subglottal resonance , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Evandro B. Gouvêa,et al.  Speaker normalization through formant-based warping of the frequency scale , 1997, EUROSPEECH.

[3]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[5]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Abeer Alwan,et al.  Automatic estimation of the first subglottal resonance. , 2011, The Journal of the Acoustical Society of America.

[7]  Abeer Alwan,et al.  Bark-shift based nonlinear speaker normalization using the second subglottal resonance , 2009, INTERSPEECH.

[8]  Abeer Alwan,et al.  Analysis and Automatic Estimation of Children's Subglottal Resonances , 2011, INTERSPEECH.

[9]  Youngsook Jung,et al.  Acoustic Articulatory Evidence for Quantal Vowel Categories: The Features (low) and (back) , 2009 .

[10]  Steven M. Lulich Subglottal resonances and distinctive features , 2010, J. Phonetics.

[11]  H. Wakita Normalization of vowels by vocal-tract length and its application to vowel identification , 1977 .

[12]  Xuemin Chi,et al.  Subglottal coupling and its influence on vowel formants. , 2007, The Journal of the Acoustical Society of America.

[13]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[14]  Abeer Alwan,et al.  Automatic detection of the second subglottal resonance and its application to speaker normalization. , 2009, The Journal of the Acoustical Society of America.

[15]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[17]  Abeer Alwan,et al.  Automatic estimation of the second subglottal resonance from natural speech , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Abeer Alwan,et al.  Speaker normalization based on subglottal resonances , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.