Recognizing emotions in dialogues with acoustic and lexical features

Automatic emotion recognition has long been a focus of Affective Computing. We aim at improving the performance of state-of-the-art emotion recognition in dialogues using novel knowledge-inspired features and modality fusion strategies. We propose features based on disfluencies and nonverbal vocalisations (DIS-NVs), and show that they are highly predictive for recognizing emotions in spontaneous dialogues. We also propose the hierarchical fusion strategy as an alternative to current feature-level and decision-level fusion. This fusion strategy combines features from different modalities at different layers in a hierarchical structure. It is expected to overcome limitations of feature-level and decision-level fusion by including knowledge on modality differences, while preserving information of each modality.

[1]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[2]  Laurence Devillers,et al.  Detection of real-life emotions in call centers , 2005, INTERSPEECH.

[3]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[4]  Sidney K. D'Mello,et al.  Consistent but modest: a meta-analysis on unimodal and multimodal affect detection accuracies from 30 studies , 2012, ICMI '12.

[5]  P. Vuilleumier,et al.  How brains beware: neural mechanisms of emotional attention , 2005, Trends in Cognitive Sciences.

[6]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[7]  R. Levenson Autonomic Nervous System Differences among Emotions , 1992 .

[8]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[9]  P. Ekman,et al.  DIFFERENCES Universals and Cultural Differences in the Judgments of Facial Expressions of Emotion , 2004 .

[10]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[11]  Be at Odds? Deep and Hierarchical Neural Networks for Classification and Regression of Conflict in Speech , 2015 .

[12]  Louis-Philippe Morency,et al.  Step-wise emotion recognition using concatenated-HMM , 2012, ICMI '12.

[13]  Björn W. Schuller,et al.  Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.

[14]  Randolph R. Cornelius THEORETICAL APPROACHES TO EMOTION , 2000 .

[15]  William Stafford Noble,et al.  Support vector machine , 2013 .

[16]  Peter Robinson,et al.  Dimensional affect recognition using Continuous Conditional Random Fields , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Catherine Lai,et al.  RECOGNIZING EMOTIONS IN DIALOGUES WITH DISFLUENCIES AND NON-VERBAL VOCALISATIONS , 2015 .

[19]  Tomoki Toda,et al.  Emotion and Its Triggers in Human Spoken Dialogue: Recognition and Analysis , 2016 .

[20]  Shrikanth S. Narayanan,et al.  Robust Unsupervised Arousal Rating:A Rule-Based Framework withKnowledge-Inspired Vocal Features , 2014, IEEE Transactions on Affective Computing.

[21]  Dongmei Jiang,et al.  Multimodal continuous affect recognition based on LSTM and multiple kernel learning , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[22]  Koen V. Hindriks,et al.  Effects of bodily mood expression of a robotic teacher on students , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Björn W. Schuller,et al.  CCA based feature selection with application to continuous depression recognition from acoustic speech features , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[26]  Tsutomu Miyasato,et al.  Emotion recognition from audiovisual information , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[27]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[28]  S. K. Scott,et al.  Individual Differences in Laughter Perception Reveal Roles for Mentalizing and Sensorimotor Systems in the Evaluation of Emotional Authenticity , 2013, Cerebral cortex.

[29]  R. Lickley Fluency and Disfluency , 2015 .

[30]  Björn W. Schuller,et al.  Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition , 2014, IEEE Signal Processing Letters.

[31]  Björn W. Schuller,et al.  Modeling gender information for emotion recognition using Denoising autoencoder , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  Alexandru Popescu,et al.  GAMYGDALA: An Emotion Engine for Games , 2014, IEEE Transactions on Affective Computing.

[34]  Maurizio Mancini,et al.  Laughing with a Virtual Agent , 2015, Adaptive Agents and Multi-Agent Systems.

[35]  Diane J. Litman,et al.  Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor , 2011, Speech Commun..

[36]  Andrew Ortony,et al.  The Cognitive Structure of Emotions , 1988 .

[37]  Johanna D. Moore,et al.  Word-Level Emotion Recognition Using High-Level Features , 2014, CICLing.

[38]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[39]  Björn W. Schuller,et al.  Analyzing the memory of BLSTM Neural Networks for enhanced emotion classification in dyadic spoken interactions , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Ian R. Finlayson,et al.  Testing the roles of disfluency and rate of speech in the coordination of conversation , 2014 .

[41]  Arman Savran,et al.  Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering , 2012, ICMI '12.

[42]  Amy Beth Warriner,et al.  Norms of valence, arousal, and dominance for 13,915 English lemmas , 2013, Behavior Research Methods.