Estimating Speaker Height and Subglottal Resonances Using MFCCs and GMMs

This letter investigates the use of MFCCs and GMMs for 1) improving the state of the art in speaker height estimation, and 2) rapid estimation of subglottal resonances (SGRs) without relying on formant and pitch tracking (unlike our previous algorithm in [1]). The proposed system comprises a set of height-dependent GMMs modeling static and dynamic MFCC features, where each GMM is associated with a height value. Furthermore, since SGRs and height are correlated, each GMM is also associated with a set of SGR values (known a priori). Given a speech sample, speaker height and SGRs are estimated as weighted combinations of the values corresponding to the N most-likely GMMs. We assess the importance of using dynamic MFCC features and the weighted decision rule, and demonstrate the efficacy of our approach via experiments on height estimation (using TIMIT) and SGR estimation (using the Tracheal Resonance database.

[1]  Julio González,et al.  Formant frequencies and body size of speaker: a weak relationship in adult humans , 2004, J. Phonetics.

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  H J Künzel,et al.  How Well Does Average Fundamental Frequency Correlate with Speaker Height and Weight? , 1989, Phonetica.

[4]  J. Markel,et al.  The SIFT algorithm for fundamental frequency estimation , 1972 .

[5]  Youngsook Jung,et al.  Acoustic Articulatory Evidence for Quantal Vowel Categories: The Features (low) and (back) , 2009 .

[6]  Abeer Alwan,et al.  Automatic estimation of the first three subglottal resonances from adults' speech signals with application to speaker height estimation , 2013, Speech Commun..

[7]  Kenneth N. Stevens,et al.  On the quantal nature of speech , 1972 .

[8]  Sorin Dusan Estimation of speaker's height and vocal tract length from speech signal , 2005, INTERSPEECH.

[9]  Morgan Sonderegger Subglottal coupling and vowel space: an investigation in quantal theory. , 2004 .

[10]  Nikos Fakotakis,et al.  Audio Features Selection for Automatic Height Estimation from Speech , 2010, SETN.

[11]  John H. L. Hansen,et al.  VOICE ANALYSIS IN ADVERSE CONDITIONS: THE CENTENNIAL OLYMPIC PARK BOMBING 911 CALL , 1999 .

[12]  Abeer Alwan,et al.  Non-linear frequency warping for VTLN using subglottal resonances and the third formant frequency , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Shiri Gordon,et al.  An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  Steven M. Lulich Subglottal resonances and distinctive features , 2010, J. Phonetics.

[15]  Abeer Alwan,et al.  Subglottal resonances of adult male and female native speakers of American English. , 2012, The Journal of the Acoustical Society of America.