Compensating for speaker or lexical variabilities in speech for emotion recognition

Affect recognition is a crucial requirement for future human machine interfaces to effectively respond to nonverbal behaviors of the user. Speech emotion recognition systems analyze acoustic features to deduce the speaker's emotional state. However, human voice conveys a mixture of information including speaker, lexical, cultural, physiological and emotional traits. The presence of these communication aspects introduces variabilities that affect the performance of an emotion recognition system. Therefore, building robust emotional models requires careful considerations to compensate for the effect of these variabilities. This study aims to factorize speaker characteristics, verbal content and expressive behaviors in various acoustic features. The factorization technique consists in building phoneme level trajectory models for the features. We propose a metric to quantify the dependency between acoustic features and communication traits (i.e., speaker, lexical and emotional factors). This metric, which is motivated by the mutual information framework, estimates the uncertainty reduction in the trajectory models when a given trait is considered. The analysis provides important insights on the dependency between the features and the aforementioned factors. Motivated by these results, we propose a feature normalization technique based on the whitening transformation that aims to compensate for speaker and lexical variabilities. The benefit of employing this normalization scheme is validated with the presented factor analysis method. The emotion recognition experiments show that the normalization approach can attenuate the variability imposed by the verbal content and speaker identity, yielding 4.1% and 2.4% relative performance improvements on a selected set of features, respectively.

[1]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[2]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[3]  Lijiang Chen,et al.  Relative Speech Emotion Recognition Based Artificial Neural Network , 2008, 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application.

[4]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[5]  Shrikanth Narayanan,et al.  Interplay between linguistic and affective goals in facial expression during emotional utterances , 2006 .

[6]  Björn W. Schuller,et al.  Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing , 2007, ACII.

[7]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Athanasios Katsamanis,et al.  A hierarchical framework for modeling multimodality and emotional evolution in affective dialogs , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Miriam Reiner,et al.  Stroop Interference and Facilitation Effects in Kinesthetic and Haptic Tasks , 2010, Adv. Hum. Comput. Interact..

[11]  Carlos Busso,et al.  Joint Analysis of the Emotional Fingerprint in the Face and Speech: A single subject study , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[12]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[13]  David Philippou-Hübner,et al.  Vowels Formants Analysis Allows Straightforward Detection of High Arousal Acted and Spontaneous Emotions , 2011, INTERSPEECH.

[14]  Carlos Busso,et al.  A personalized emotion recognition system using an unsupervised feature adaptation scheme , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Björn W. Schuller,et al.  Segmenting into Adequate Units for Automatic Recognition of Emotion-Related Episodes: A Speech-Based Approach , 2010, Adv. Hum. Comput. Interact..

[16]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[17]  Rui Xia,et al.  Using i-Vector Space Model for Emotion Recognition , 2012, INTERSPEECH.

[18]  John H. L. Hansen,et al.  Speaker-specific pitch contour modeling and modification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19]  Colleen Richey,et al.  Effects of vocal effort and speaking style on text-independent speaker verification , 2008, INTERSPEECH.

[20]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[21]  Zhihong Zeng,et al.  Audio–Visual Affective Expression Recognition Through Multistream Fused HMM , 2008, IEEE Transactions on Multimedia.

[22]  Björn Schuller,et al.  Affect-Robust Speech Recognition by Dynamic Emotional Adaptation , 2006 .

[23]  Ingo Siegert,et al.  Vowels formants analysis allows straightforward detection of high arousal emotions , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[24]  Albert Y. Zomaya,et al.  Contemporary Computing , 2012, Communications in Computer and Information Science.

[25]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[26]  Han-Lim Choi,et al.  Learning the Covariance Dynamics of a Large-Scale Environment for Informative Path Planning of Unmanned Aerial Vehicle Sensors , 2010 .

[27]  Lawrence R. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1975, Bell Syst. Tech. J..

[28]  Shashidhar G. Koolagudi,et al.  Text Independent Emotion Recognition Using Spectral Features , 2011, IC3.

[29]  John H. L. Hansen,et al.  Feature analysis and neural network-based classification of speech under stress , 1996, IEEE Trans. Speech Audio Process..

[30]  Angeliki Metallinou,et al.  Decision level combination of multiple modalities for recognition and analysis of emotional expression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Jonathan Warrell,et al.  Tied Factor Analysis for Face Recognition across Large Pose Differences , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Mei-Yuh Hwang,et al.  Improved tone modeling for Mandarin broadcast news speech recognition , 2006, INTERSPEECH.

[34]  Carlos Busso,et al.  Factorizing speaker, lexical and emotional variabilities observed in facial expressions , 2012, 2012 19th IEEE International Conference on Image Processing.

[35]  Carlos Busso,et al.  Visual emotion recognition using compact facial representations and viseme information , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[37]  Carlos Busso,et al.  Feature and model level compensation of lexical content for facial emotion recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[38]  Angeliki Metallinou,et al.  Speaker states recognition using latent factor analysis based Eigenchannel factor vector modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[40]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[41]  Thomas Fang Zheng,et al.  Study on speaker verification on emotional speech , 2006, INTERSPEECH.

[42]  Shrikanth S. Narayanan,et al.  An articulatory study of emotional speech production , 2005, INTERSPEECH.

[43]  Carlos Busso,et al.  Iterative feature normalization for emotional speech detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Zhigang Deng,et al.  An acoustic study of emotions expressed in speech , 2004, INTERSPEECH.

[45]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[46]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[47]  Herbert Gish,et al.  A segmental speech model with applications to word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Carlos Busso,et al.  Exploring Cross-Modality Affective Reactions for Audiovisual Emotion Recognition , 2013, IEEE Transactions on Affective Computing.

[49]  L. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1974, The Bell System Technical Journal.

[50]  Herbert Gish,et al.  Parametric trajectory models for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[51]  E. Ambikairajah,et al.  Speaker Normalisation for Speech-Based Emotion Detection , 2007, 2007 15th International Conference on Digital Signal Processing.

[52]  Jean Carletta,et al.  9th European Conference on Speech Communication and Technology (Interspeech 2005 - Eurospeech) , 2005 .

[53]  Yifan Gong,et al.  Stochastic trajectory modeling and sentence searching for continuous speech recognition , 1997, IEEE Trans. Speech Audio Process..

[54]  Carlos Busso,et al.  Interpreting ambiguous emotional expressions , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.