Multi-System Fusion of Extended Context Prosodic and Cepstral Features for Paralinguistic Speaker Trait Classification

As automatic speech processing has matured, research attention has expanded to paralinguistic speech problems that aim to detect beyond-the-words information. This paper focuses on the identification of seven speaker trait categories from the Interspeech Speaker Trait Challenge: likeability, intelligibility, openness, conscientiousness, extraversion, agreeableness, and neuroticism. Our approach combines multiple features including prosodic, cepstral, shifted-delta cepstral, and a reduced set of the OpenSMILE features. Our classification approaches included GMM-UBM, eigenchannel, support vector machines, and distance based classifiers. Optimized feature reduction and logistic regression-based score calibration and fusion led to results that perform competitively against the challenge baseline in all categories.

[1]  Mireille Avigal,et al.  Supervector Dimension Reduction for Efficient Speaker Age Estimation Based on the Acoustic Speech Signal , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Tanja Schultz,et al.  Speaker Characteristics , 2007, Speaker Classification.

[3]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[4]  Andreas Stolcke,et al.  Combining Prosodic Lexical and Cepstral Systems for Deceptive Speech Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Andreas Stolcke,et al.  Effective Arabic Dialect Classification Using Diverse Phonotactic Models , 2011, INTERSPEECH.

[6]  Patrick Kenny,et al.  New MAP estimators for speaker recognition , 2003, INTERSPEECH.

[7]  Elmar Nöth,et al.  The INTERSPEECH 2012 Speaker Trait Challenge , 2012, INTERSPEECH.

[8]  Eliathamby Ambikairajah,et al.  Language Identification using Warping and the Shifted Delta Cepstrum , 2005, 2005 IEEE 7th Workshop on Multimedia Signal Processing.

[9]  Adam J. Sporka,et al.  Recognition of Personality Traits from Human Spoken Conversations , 2011, INTERSPEECH.

[10]  Björn W. Schuller,et al.  Towards More Reality in the Recognition of Emotional Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Elizabeth Shriberg,et al.  Higher-Level Features in Speaker Recognition , 2007, Speaker Classification.

[12]  Andreas Stolcke,et al.  Prosody-based automatic detection of annoyance and frustration in human-computer dialog , 2002, INTERSPEECH.

[13]  Dimitra Vergyri,et al.  Using Prosodic and Spectral Features in Detecting Depression in Elderly Males , 2011, INTERSPEECH.

[14]  Hande Kaymaz-Keskinpala,et al.  Screening for high risk suicidal states using mel-cepstral coefficients and energy in frequency bands , 2007, 2007 15th European Signal Processing Conference.

[15]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Andreas Stolcke,et al.  Improving Language Recognition with Multilingual Phone Recognition and Speaker Adaptation Transforms , 2010, Odyssey.

[17]  Lukás Burget,et al.  Investigations into prosodic syllable contour features for speaker recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Bin Yang,et al.  The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[19]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..