Non-linear frequency warping for VTLN using subglottal resonances and the third formant frequency

This paper proposes a non-linear frequency warping scheme for VTLN. It is based on mapping the subglottal resonances (SGRs) and the third formant frequency (F3) of a given utterance to those of a reference speaker. SGRs are used because they relate to formants in specific ways while remaining phonetically invariant, and F3 is used because it is somewhat correlated to vocal-tract length. Given an utterance, the warping parameters (SGRs and F3) are determined by obtaining initial estimates from the signal, and refining the estimates with respect to a speaker-independent model. For children (TIDIGITS), the proposed method yields statistically-significant word error rate (WER) reductions (up to 15%) relative to conventional VTLN (linear warping) when: (1) speakers show poor baseline performance, and/or (2) training data are limited. For adults (Wall Street Journal), the WER reduction relative to conventional VTLN is 4-5%. Comparison with other non-linear warping techniques is also reported.

[1]  Alexandros Potamianos,et al.  Region-based vocal tract length normalization for ASR , 2008, INTERSPEECH.

[2]  Abeer Alwan,et al.  Analysis and Automatic Estimation of Children's Subglottal Resonances , 2011, INTERSPEECH.

[3]  Mark J. F. Gales,et al.  Development of the 2003 CU-HTK conversational telephone speech transcription system , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[5]  Alfred Mertins,et al.  Enhancing Vocal Tract Length Normalization with Elastic Registration for Automatic Speech Recognition , 2012, INTERSPEECH.

[6]  Srinivasan Umesh,et al.  Acoustic class specific VTLN-warping using regression class trees , 2009, INTERSPEECH.

[7]  Srinivasan Umesh,et al.  A simple approach to non-uniform vowel normalization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  William J. Byrne,et al.  Speaker normalization with all-pass transforms , 1998, ICSLP.

[9]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[10]  Abeer Alwan,et al.  Automatic estimation of the first two subglottal resonances in children's speech with application to speaker normalization in limited-data conditions , 2012, INTERSPEECH.

[11]  Abeer Alwan,et al.  Automatic estimation of the first three subglottal resonances from adults' speech signals with application to speaker height estimation , 2013, Speech Commun..

[12]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[13]  Srinivasan Umesh,et al.  VTLN Using Analytically Determined Linear-Transformation on Conventional MFCC , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Evandro B. Gouvêa,et al.  Speaker normalization through formant-based warping of the frequency scale , 1997, EUROSPEECH.

[15]  Abeer Alwan,et al.  Subglottal resonances of adult male and female native speakers of American English. , 2012, The Journal of the Acoustical Society of America.

[16]  Eduardo Lleida,et al.  Augmented state space acoustic decoding for modeling local variability in speech , 2005, INTERSPEECH.

[17]  Geoffrey Zweig,et al.  Speaker adaptation with an Exponential Transform , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[18]  Jean-Luc Gauvain,et al.  Conversational telephone speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[19]  G. Fant Non-uniform vowel normalization , 1975 .

[20]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[21]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[22]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[23]  Abeer Alwan,et al.  Automatic detection of the second subglottal resonance and its application to speaker normalization. , 2009, The Journal of the Acoustical Society of America.

[24]  Steven M. Lulich Subglottal resonances and distinctive features , 2010, J. Phonetics.

[25]  Youngsook Jung,et al.  Acoustic Articulatory Evidence for Quantal Vowel Categories: The Features (low) and (back) , 2009 .

[26]  Srinivasan Umesh,et al.  A shift-based approach to speaker normalization using non-linear frequency-scaling model , 2008, Speech Commun..

[27]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Philip C. Woodland,et al.  Experiments in speaker normalisation and adaptation for large vocabulary speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[30]  Abeer Alwan,et al.  Bark-shift based nonlinear speaker normalization using the second subglottal resonance , 2009, INTERSPEECH.

[31]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[32]  S V Bharath Kumar,et al.  Nonuniform speaker normalization using affine transformation. , 2008, The Journal of the Acoustical Society of America.

[33]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..