Estimation of unknown speaker’s height from speech

In the present study, we propose a regression-based scheme for the direct estimation of the height of unknown speakers from their speech. In this scheme every speech input is decomposed via the openSMILE audio parameterization to a single feature vector that is fed to a regression model, which provides a direct estimation of the persons’ height. The focus in this study is on the evaluation of the appropriateness of several linear and non-linear regression algorithms on the task of automatic height estimation from speech. The performance of the proposed scheme is evaluated on the TIMIT database, and the experimental results show an accuracy of 0.053 meters, in terms of mean absolute error, for the best performing Bagging regression algorithm. This accuracy corresponds to an averaged relative error of approximately 3%. We deem that the direct estimation of the height of unknown people from speech provides an important additional feature for improving the performance of various surveillance, profiling and access authorization applications.

[1]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[2]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition: Fundamentals and Applications , 1995 .

[3]  Nikos Fakotakis,et al.  Speech segmentation using regression fusion of boundary predictions , 2010, Comput. Speech Lang..

[4]  M. R. Manzini Syntactic approaches to cliticization , 1998 .

[5]  W. V. van Dommelen,et al.  Acoustic Parameters in Speaker Height and Weight Identification: Sex-Specific Behaviour , 1995, Language and speech.

[6]  DAVID G. KENDALL,et al.  Introduction to Mathematical Statistics , 1947, Nature.

[7]  M. van Oostendorp,et al.  Schwa in phonological theory , 1998 .

[8]  W. Fitch Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques. , 1997, The Journal of the Acoustical Society of America.

[9]  Walter H. Manning,et al.  Listener estimations of speaker height and weight in unfiltered and filtered conditions , 1982 .

[10]  Figen Ertaş,et al.  FUNDAMENTALS OF SPEAKER RECOGNITION , 2011 .

[11]  Anil K. Jain,et al.  Can soft biometric traits assist user recognition? , 2004, SPIE Defense + Commercial Sensing.

[12]  Takao Kobayashi,et al.  Phone duration modeling using gradient tree boosting , 2008, Speech Commun..

[13]  Anna Esposito Verbal and Nonverbal Communication Behaviours, COST Action 2102 International Workshop, Vietri sul Mare, Italy, March 29-31, 2007, Revised Selected and Invited Papers , 2007, COST 2102 Workshop.

[14]  Anton Batliner,et al.  Speaker Characteristics and Emotion Classification , 2007, Speaker Classification.

[15]  N J Lass,et al.  An investigation of speaker height and weight identification. , 1976, The Journal of the Acoustical Society of America.

[16]  H J Künzel,et al.  How Well Does Average Fundamental Frequency Correlate with Speaker Height and Weight? , 1989, Phonetica.

[17]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[18]  J. Friedman Stochastic gradient boosting , 2002 .

[19]  Robert L. Vislocky,et al.  Generalized Additive Models versus Linear Regression in Generating Probabilistic MOS Forecasts of Aviation Weather Parameters , 1995 .

[20]  Julio González,et al.  Research in acoustics of human speech sounds : Correlates and perception of speaker body size , 2007 .

[21]  Karl-Erik Spens,et al.  Profound deafness and speech communication , 1995 .

[22]  Daniel Elenius,et al.  Estimating speaker characteristics for speech recognition , 2009 .

[23]  John H. L. Hansen,et al.  Dialect/Accent Classification Using Unrestricted Audio , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Marko Robnik-Sikonja,et al.  An adaptation of Relief for attribute estimation in regression , 1997, ICML.

[25]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[26]  Shingo Kuroiwa,et al.  Robust speech detection method for telephone speech recognition system , 1999, Speech Commun..

[27]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[28]  Roddy Cowie,et al.  Speakers and hearers are people: reflections on speech deterioration as a consequence of acquired deafness , 1995 .

[29]  Sorin Dusan,et al.  Estimation of speaker's height and vocal tract length from speech signal , 2005, INTERSPEECH.

[30]  Florian Metze,et al.  Comparison of Four Approaches to Age and Gender Recognition for Telephone Applications , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[31]  Norman J. Lass,et al.  Effect of Vocal Disguise on Estimations of Speakers' Heights and Weights , 1982, Perceptual and motor skills.

[32]  Lawrence H. Smith,et al.  An estimate of physical scale from speech , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  István Kispál HUMAN HEIGHT ESTIMATION USING A CALIBRATED CAMERA , 2006 .

[34]  Julio Gonzalez,et al.  Estimation of Speakers' Weight and Height from Speech: A Re-Analysis of Data from Multiple Studies by Lass and Colleagues , 2003, Perceptual and motor skills.

[35]  H. Akaike A new look at the statistical model identification , 1974 .

[36]  W. Fitch,et al.  Morphology and development of the human vocal tract: a study using magnetic resonance imaging. , 1999, The Journal of the Acoustical Society of America.

[37]  Norman J. Lass,et al.  The Effect of Filtered Speech on Speaker Height and Weight Identification. , 1980 .

[38]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[39]  N J Lass,et al.  Correlational study of speakers' heights, weights, body surface areas, and speaking fundamental frequencies. , 1978, The Journal of the Acoustical Society of America.

[40]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[41]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[42]  G. Kelemen,et al.  Physiology of the larynx. , 1955, Physiological reviews.

[43]  Anna Esposito,et al.  Fundamentals of verbal and nonverbal communication and the biometric issue , 2007 .

[44]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition , 1996 .

[45]  Anil K. Jain,et al.  Biometric technology for human identification , 2004 .

[46]  Christian Müller Speaker Classification II, Selected Projects , 2007, Speaker Classification.

[47]  Thomas P. Barnwell,et al.  Unsupervised estimation of the human vocal tract length over sentence level utterances , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[48]  Wim A. van Dommelen,et al.  Speaker height and weight identification: a re-evaluation of some old data , 1993 .

[49]  Zhen-Yang Wu,et al.  Robust GMM Based Gender Classification using Pitch and RASTA-PLP Parameters of Speech , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[50]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[51]  D. Rendall,et al.  Pitch (F0) and formant profiles of human vowels and vowel-like baboon grunts: the role of vocalizer body size and voice-acoustic allometry. , 2005, The Journal of the Acoustical Society of America.

[52]  Chih-Jen Lin,et al.  Training v-Support Vector Regression: Theory and Algorithms , 2002, Neural Computation.

[53]  Korin Richmond Estimating velum height from acoustics during continuous speech , 1999, EUROSPEECH.

[54]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[55]  S. Collins,et al.  Men's voices and women's choices , 2000, Animal Behaviour.

[56]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[57]  John H. L. Hansen,et al.  VOICE ANALYSIS IN ADVERSE CONDITIONS: THE CENTENNIAL OLYMPIC PARK BOMBING 911 CALL , 1999 .