Multitask speaker profiling for estimating age, height, weight and smoking habits from spontaneous telephone speech signals

This paper proposes a novel approach for automatic estimation of four important traits of speakers, namely age, height, weight and smoking habit, from speech signals. In this method, each utterance is modeled using the i-vector framework which is based on the factor analysis on Gaussian Mixture Model (GMM) mean supervectors, and the Non-negative Factor Analysis (NFA) framework which is based on a constrained factor analysis on GMM weights. Then, Artificial Neural Networks (ANNs) and Least Squares Support Vector Regression (LSSVR) are employed to estimate age, height and weight of speakers from given utterances, and ANNs and logistic regression (LR) are utilized to perform smoking habit detection. Since GMM weights provide complementary information to GMM means, a score-level fusion of the i-vector-based and the NFA-based recognizers is considered for age and smoking habit estimation tasks to improve the performance. In addition, a multitask speaker profiling approach is proposed to evaluate the correlated tasks simultaneously and in interaction with each other, and consequently, to boost the accuracy in speaker age, height, weight and smoking habit estimations. To this end, a hybrid architecture involving the score-level fusion of the i-vector-based and the NFA-based recognizers is proposed to exploit the available information in both Gaussian means and Gaussian weights. ANNs are then employed to share the learned information with all tasks while they are learned in parallel. The proposed method is evaluated on telephone speech signals of National Institute for Standards and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation (SRE) corpora. Experimental results over 1194 utterances show the effectiveness of the proposed method in automatic speaker profiling.

[1]  Hugo Van hamme,et al.  Speaker age estimation using i-vectors , 2014, Eng. Appl. Artif. Intell..

[2]  Christian A. Müller,et al.  A Study of Acoustic Correlates of Speaker Age , 2007, Speaker Classification.

[3]  Björn W. Schuller,et al.  Semantic Speech Tagging: Towards Combined Analysis of Speaker Traits , 2011, Semantic Audio.

[4]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[5]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machines , 2002 .

[6]  John H. L. Hansen,et al.  VOICE ANALYSIS IN ADVERSE CONDITIONS: THE CENTENNIAL OLYMPIC PARK BOMBING 911 CALL , 1999 .

[7]  Hugo Van hamme,et al.  Accent recognition using i-vector, Gaussian Mean Supervector and Gaussian posterior probability supervector for spontaneous telephone speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Ali Soleimani,et al.  Age estimation based on speech features and support vector machine , 2011, 2011 3rd Computer Science and Electronic Engineering Conference (CEEC).

[9]  James R. Glass,et al.  Non-Negative Factor Analysis of Gaussian Mixture Model Weight Adaptation for Language and Dialect Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Björn W. Schuller,et al.  Paralinguistics in speech and language - State-of-the-art and the challenge , 2013, Comput. Speech Lang..

[11]  Lukás Burget,et al.  Brno University of Technology system for Interspeech 2009 emotion challenge , 2009, INTERSPEECH.

[12]  Tom Heskes,et al.  Task Clustering and Gating for Bayesian Multitask Learning , 2003, J. Mach. Learn. Res..

[13]  Amir Hossein Poorjam,et al.  Speaker Profiling for Forensic Applications , 2014 .

[14]  W. V. van Dommelen,et al.  Acoustic Parameters in Speaker Height and Weight Identification: Sex-Specific Behaviour , 1995, Language and speech.

[15]  Mireille Avigal,et al.  Dimension reduction approaches for SVM based speaker age estimation , 2009, INTERSPEECH.

[16]  Abeer Alwan,et al.  Automatic estimation of the first three subglottal resonances from adults' speech signals with application to speaker height estimation , 2013, Speech Commun..

[17]  Mohamad Hasan Bahari Automatic Speaker Characterization Automatic Identification of Gender, Age, Language and Accent from Speech Signals , 2014 .

[18]  Steve An Xue, Dimitar Deliyski EFFECTS OF AGING ON SELECTED ACOUSTIC VOICE PARAMETERS: PRELIMINARY NORMATIVE DATA AND EDUCATIONAL IMPLICATIONS , 2001 .

[19]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Hugo Van hamme,et al.  Gaussian Mixture Model Weight Supervector Decomposition and Adaptation , 2013 .

[21]  Hugo Van hamme,et al.  Rapid speaker adaptation in latent speaker space with non-negative matrix factorization , 2013, Speech Commun..

[22]  Mireille Avigal,et al.  Supervector Dimension Reduction for Efficient Speaker Age Estimation Based on the Acoustic Speech Signal , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[25]  V. Wan,et al.  LEARNING IN CONNECTIONIST SPEECH RECOGNITION , 2004 .

[26]  Angelika Braun,et al.  The influence of smoking habits on perceived age , 1995 .

[27]  H. Gilbert,et al.  The effects of cigarette smoking on the female voice , 2012, Logopedics, phoniatrics, vocology.

[28]  Jitendra Ajmera,et al.  Age and gender classification using modulation cepstrum , 2008, Odyssey.

[29]  Nikos Fakotakis,et al.  Audio Features Selection for Automatic Height Estimation from Speech , 2010, SETN.

[30]  Y Horii,et al.  Cigarette smoking and voice fundamental frequency. , 1982, Journal of communication disorders.

[31]  Hugo Van hamme,et al.  Height estimation from speech signals using i-vectors and least-squares support vector regression , 2015, 2015 38th International Conference on Telecommunications and Signal Processing (TSP).

[32]  Christian A. Müller,et al.  Combining regression and classification methods for improving automatic speaker age recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Elisabeth André,et al.  Improving Automatic Emotion Recognition from Speech via Gender Differentiaion , 2006, LREC.

[34]  Florian Metze,et al.  Comparison of Four Approaches to Age and Gender Recognition for Telephone Applications , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[35]  Julio González,et al.  Formant frequencies and body size of speaker: a weak relationship in adult humans , 2004, J. Phonetics.

[36]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[37]  Hugo Van hamme,et al.  Age Estimation from Telephone Speech using i-vectors , 2012, INTERSPEECH.

[38]  N. Brummer,et al.  On calibration of language recognition scores , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[39]  Lukás Burget,et al.  Brno university of technology system for interspeech 2010 paralinguistic challenge , 2010, INTERSPEECH.

[40]  W. Ryan,et al.  Acoustic aspects of the aging voice. , 1972, Journal of gerontology.