Leveraging inter-rater agreement for audio-visual emotion recognition

Human expressions are often ambiguous and unclear, resulting in disagreement or confusion among different human evaluators. In this paper, we investigate how audiovisual emotion recognition systems can leverage prototypicality, the level of agreement or confusion among human evaluators. We propose the use of a weighted Support Vector Machine to explicitly model the relationship between the prototypicality of training instances and evaluated emotion from the IEMOCAP corpus. We choose weights of prototypical and non-prototypical instances based on the maximal accuracy of each speaker. We then provide per-speaker analysis to understand specific speech characteristics associated with the information gain of emotion given prototypicality information. Our experimental results show that neutrality, one of the most challenging emotion to recognize, has the highest performance gain from prototypicality information, compared to other emotion classes: Angry, Happy, and Sad. We also show that the proposed method improves the overall multi-class classification accuracy significantly over traditional methods that do not leverage prototypicality.

[1]  Björn Schuller,et al.  Selecting Training Data for Cross-Corpus Speech Emotion Recognition: Prototypicality vs. Generalization , 2011 .

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[4]  Emily Mower Provost,et al.  Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[6]  Tim Polzehl,et al.  Emotion detection in dialog systems: Applications, strategies and challenges , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[7]  Björn W. Schuller,et al.  A multitask approach to continuous five-dimensional affect sensing in natural speech , 2012, TIIS.

[8]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[9]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Carlos Busso,et al.  Interpreting ambiguous emotional expressions , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[11]  Björn W. Schuller,et al.  Paralinguistics in speech and language - State-of-the-art and the challenge , 2013, Comput. Speech Lang..

[12]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[13]  Emily Mower Provost,et al.  EmoShapelets: Capturing local dynamics of audio-visual affective speech , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[14]  Emily Mower Provost,et al.  Say Cheese vs. Smile: Reducing Speech-Related Variability for Facial Emotion Recognition , 2014, ACM Multimedia.

[15]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[16]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  E. Nöth,et al.  Releasing a thoroughly annotated and processed spontaneous emotional database : the FAU Aibo Emotion Corpus , 2008 .

[18]  Carlos Busso,et al.  Visual emotion recognition using compact facial representations and viseme information , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Carlos Busso,et al.  Feature and model level compensation of lexical content for facial emotion recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[20]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[21]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[22]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[23]  Yue Wang,et al.  Weighted support vector machine for data classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..