Word-Level Emotion Recognition Using High-Level Features

In this paper, we investigate the use of high-level features for recognizing human emotions at the word-level in natural conversations with virtual agents. Experiments were carried out on the 2012 Audio/Visual Emotion Challenge AVEC2012 database, where emotions are defined as vectors in the Arousal-Expectancy-Power-Valence emotional space. Our model using 6 novel disfluency features yields significant improvements compared to those using large number of low-level spectral and prosodic features, and the overall performance difference between it and the best model of the AVEC2012 Word-Level Sub-Challenge is not significant. Our visual model using the Active Shape Model visual features also yields significant improvements compared to models using the low-level Local Binary Patterns visual features. We built a bimodal model By combining our disfluency and visual feature sets and applying Correlation-based Feature-subset Selection. Considering overall performance on all emotion dimensions, our bimodal model outperforms the second best model of the challenge, and comes close to the best model. It also gives the best result when predicting Expectancy values.

[1]  Stephen Milborrow The MUCT Landmarked Face Database , 2010 .

[2]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[3]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Jean Carletta,et al.  Detecting summarization hot spots in meetings using group level involvement and turn-taking features , 2013, INTERSPEECH.

[6]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[7]  Tsutomu Miyasato,et al.  Multimodal human emotion/expression recognition , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[8]  Arman Savran,et al.  Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering , 2012, ICMI '12.

[9]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[10]  K. Scherer Expression of emotion in voice and music. , 1995, Journal of voice : official journal of the Voice Foundation.

[11]  Louis-Philippe Morency,et al.  Step-wise emotion recognition using concatenated-HMM , 2012, ICMI '12.

[12]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[13]  Tsutomu Miyasato,et al.  Degree of Human Perception of Facial Emotions Based on Audio and Video Information , 1996 .

[14]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[15]  Catherine Pelachaud,et al.  A multimodal fuzzy inference system using a continuous facial expression representation for emotion detection , 2012, ICMI '12.

[16]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Timothy F. Cootes,et al.  Comparing Active Shape Models with Active Appearance Models , 1999, BMVC.

[19]  Fred Nicolls,et al.  Active shape models with SIFT descriptors and MARS , 2015, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[20]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[21]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[22]  Scotty D. Craig,et al.  AutoTutor Detects and Responds to Learners Affective and Cognitive States , 2008 .

[23]  Laurens van der Maaten Audio-visual emotion challenge 2012: a simple approach , 2012, ICMI '12.

[24]  Radoslaw Niewiadomski,et al.  Laugh-aware virtual agent and its impact on user amusement , 2013, AAMAS.

[25]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[26]  Mohamed Chetouani,et al.  Robust continuous prediction of human emotions using multiscale dynamic cues , 2012, ICMI '12.

[27]  Andreas Stolcke,et al.  Enriching speech recognition with automatic detection of sentence boundaries and disfluencies , 2006, IEEE Transactions on Audio, Speech, and Language Processing.