Emotion Recognition using Acoustic and Lexical Features

In this paper we present an innovative approach for utterance-level emotion recognition by fusing acoustic features with lexical features extracted from automatic speech recognition (ASR) output. The acoustic features are generated by combining: (1) a novel set of features that are derived from segmental Mel Frequency Cepstral Coefficients (MFCC) scored against emotion-dependent Gaussian mixture models, and (2) statistical functionals of low-level feature descriptors such as intensity, fundamental frequency, jitter, shimmer, etc. These acoustic features are fused with two types of lexical features extracted from the ASR output: (1) presence/absence of word stems, and (2) bag-of-words sentiment categories. The combined feature set is used to train support vector machines (SVM) for emotion classification. We demonstrate the efficacy of our approach by performing fourway emotion recognition on the University of Southern California’s Interactive Emotional Motion Capture (USC-IEMOCAP) corpus. Our experiments show that the fusion of acoustic and lexical features delivers an emotion recognition accuracy of 65.7%, outperforming the previously reported best results on this challenging dataset.

[1]  Rohit Prasad,et al.  Model-based parametric features for emotion recognition from speech , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[2]  Björn W. Schuller,et al.  Emotion recognition from speech: Putting ASR in the loop , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[4]  Björn W. Schuller,et al.  Late fusion of individual engines for improved recognition of negative emotion in speech - learning vs. democratic vote , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[6]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[7]  Angeliki Metallinou,et al.  Speaker states recognition using latent factor analysis based Eigenchannel factor vector modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[9]  Angeliki Metallinou,et al.  Decision level combination of multiple modalities for recognition and analysis of emotional expression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[11]  Olivier Poch,et al.  A maximum likelihood approximation method for Dirichlet's parameter estimation , 2008, Comput. Stat. Data Anal..

[12]  D. Mitchell Wilkes,et al.  Investigation of vocal jitter and glottal flow spectrum as possible cues for depression and near-term suicidal risk , 2004, IEEE Transactions on Biomedical Engineering.

[13]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[14]  Kornel Laskowski,et al.  Combining Efforts for Improving Automatic Classification of Emotional User States , 2006 .

[15]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[16]  T. Minka Estimating a Dirichlet distribution , 2012 .

[17]  Ning Wang,et al.  Creating Rapport with Virtual Agents , 2007, IVA.