Speaker age estimation on conversational telephone speech using senone posterior based i-vectors

Automatic age estimation from speech has a variety of applications including natural human-computer interaction, targeted advertising, customer-agent pairing in call centers, and forensics, to mention a few. Recently, the use of i-vectors has shown promise for automatic age estimation. In this paper, we adopt a phonetically-aware i-vector extractor for the age estimation problem. Such senone i-vector based schemes have demonstrated success in the speaker recognition field. Fixed-length and low-dimensional i-vectors are first conditioned through a linear discriminant analysis (LDA) transform, and then used to train a support vector regression (SVR) model. Additionally, in contrast to previous work, we employ the use of the logarithm of the age as the target in training the SVR to further penalize estimation errors for younger speakers compared with older speakers. The proposed system is evaluated using telephony speech material extracted from the NIST SRE 2008 and 2010 evaluation corpora. Experimental results indicate solid age estimation performance with a mean absolute error (MAE) of 4.7 years for both male and female speakers on the NIST SRE 2010 telephony test set.

[1]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[2]  Christopher Cieri,et al.  Resources for new research directions in speaker recognition: the mixer 3, 4 and 5 corpora , 2007, INTERSPEECH.

[3]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[4]  Hugo Van hamme,et al.  Age Estimation from Telephone Speech using i-vectors , 2012, INTERSPEECH.

[5]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[7]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[8]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[9]  Björn W. Schuller,et al.  Paralinguistics in speech and language - State-of-the-art and the challenge , 2013, Comput. Speech Lang..

[10]  Linda Brandschain,et al.  Mixer 6 , 2010, LREC.

[11]  Tomi Kinnunen,et al.  Exploring ANN back-ends for i-vector based speaker age estimation , 2015, INTERSPEECH.

[12]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[13]  Shrikanth S. Narayanan,et al.  Automatic speaker age and gender recognition using acoustic and prosodic level information fusion , 2013, Comput. Speech Lang..

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[16]  Dimitrios Dimitriadis,et al.  Investigating factor analysis features for deep neural networks in noisy speech recognition , 2015, INTERSPEECH.

[17]  Florian Metze,et al.  Comparison of Four Approaches to Age and Gender Recognition for Telephone Applications , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Lukás Burget,et al.  Brno university of technology system for interspeech 2010 paralinguistic challenge , 2010, INTERSPEECH.

[19]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[20]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[21]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Mireille Avigal,et al.  Supervector Dimension Reduction for Efficient Speaker Age Estimation Based on the Acoustic Speech Signal , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Elmar Nöth,et al.  Analyzing features for automatic age estimation on cross-sectional data , 2009, INTERSPEECH.

[24]  Hugo Van hamme,et al.  Speaker age estimation using i-vectors , 2014, Eng. Appl. Artif. Intell..

[25]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[26]  M. A. Kohler,et al.  Language identification using shifted delta cepstra , 2002, The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002..

[27]  Elmar Nöth,et al.  Age and gender recognition based on multiple systems - early vs. late fusion , 2010, INTERSPEECH.

[28]  Saeid Safavi,et al.  Identification of age-group from children's speech by computers and humans , 2014, INTERSPEECH.

[29]  Mohamed Kamal Omar,et al.  Training Universal Background Models for Speaker Recognition , 2010, Odyssey.

[30]  Christian A. Müller,et al.  A Study of Acoustic Correlates of Speaker Age , 2007, Speaker Classification.