Exploring the Contextual Factors Affecting Multimodal Emotion Recognition in Videos

Emotional expressions form a key part of user behavior on today's digital platforms. While multimodal emotion recognition techniques are gaining research attention, there is a lack of deeper understanding on how visual and non-visual features can be used to better recognize emotions in certain contexts, but not others. This study analyzes the interplay between the effects of multimodal emotion features derived from facial expressions, tone and text in conjunction with two key contextual factors: i) gender of the speaker, and ii) duration of the emotional episode. Using a large public dataset of 2,176 manually annotated YouTube videos, we found that while multimodal features consistently outperformed bimodal and unimodal features, their performance varied significantly across different emotions, gender and duration contexts. Multimodal features performed particularly better for male speakers in recognizing most emotions. Furthermore, multimodal features performed particularly better for shorter than for longer videos in recognizing neutral and happiness, but not sadness and anger. These findings offer new insights towards the development of more context-aware emotion recognition and empathetic systems.

[1]  Yunde Jia,et al.  Audio-visual emotion recognition with boosted coupled HMM , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[2]  Marta R. Costa-jussà,et al.  An analysis of gender bias studies in natural language processing , 2019, Nature Machine Intelligence.

[3]  Marianne Winslett,et al.  Give Me One Portrait Image, I Will Tell You Your Emotion and Personality , 2018, ACM Multimedia.

[4]  N. Frijda Moods, emotion episodes, and emotions. , 1993 .

[5]  Björn W. Schuller,et al.  Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[6]  Stefan Winkler,et al.  Group happiness assessment using geometric features and dataset balancing , 2016, ICMI.

[7]  Rafael A. Calvo,et al.  Combining Classifiers in Multimodal Affect Detection , 2012, AusDM.

[8]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[9]  Mark A. Clements,et al.  Multimodal Affect Classification at Various Temporal Lengths , 2015, IEEE Transactions on Affective Computing.

[10]  K. Oatley,et al.  Human emotions : a reader , 1998 .

[11]  Ross A. Thompson,et al.  Sex differences in the recognition of infant facial expressions of emotion: The primary caretaker hypothesis , 1985 .

[12]  Arthur C. Graesser,et al.  Unimodal and Multimodal Human Perceptionof Naturalistic Non-Basic Affective Statesduring Human-Computer Interactions , 2013, IEEE Transactions on Affective Computing.

[13]  Stefan Wermter,et al.  Multimodal emotional state recognition using sequence-dependent deep hierarchical features , 2015, Neural Networks.

[14]  Dong-Yan Huang,et al.  Audio-visual emotion recognition using deep transfer learning and multiple temporal models , 2017, ICMI.

[15]  Björn W. Schuller,et al.  Recognizing Affect from Linguistic Information in 3D Continuous Space , 2011, IEEE Transactions on Affective Computing.

[16]  Mark Hasegawa-Johnson,et al.  Multi-Task Learning Using Mismatched Transcription for Under-Resourced Speech Recognition , 2017, INTERSPEECH.

[17]  H. Meeren,et al.  Beyond the face: exploring rapid influences of context on face processing. , 2006, Progress in brain research.

[18]  Stefan Winkler,et al.  Deep Learning for Emotion Recognition on Small Datasets using Transfer Learning , 2015, ICMI.

[19]  Rong Tong,et al.  Multi-Task Learning for Mispronunciation Detection on Singapore Children's Mandarin Speech , 2017, INTERSPEECH.

[20]  Chung-Hsien Wu,et al.  Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels , 2015, IEEE Transactions on Affective Computing.

[21]  Philippe Verduyn,et al.  Which emotions last longest and why: The role of event importance and rumination , 2015 .

[22]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[23]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[24]  Arthur C. Graesser,et al.  Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features , 2010, User Modeling and User-Adapted Interaction.

[25]  M. Cabanac What is emotion? , 2002, Behavioural Processes.

[26]  Peter Robinson,et al.  Dimensional affect recognition using Continuous Conditional Random Fields , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[27]  Björn Schuller,et al.  Asynchronous and Event-Based Fusion Systems for Affect Recognition on Naturalistic Data in Comparison to Conventional Approaches , 2018, IEEE Transactions on Affective Computing.

[28]  WuChung-Hsien,et al.  Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition , 2012 .

[29]  Sidney K. D'Mello,et al.  A Review and Meta-Analysis of Multimodal Affect Detection Systems , 2015, ACM Comput. Surv..

[30]  Bin Ma,et al.  Strategies for Vietnamese keyword search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Munaf Rashid,et al.  Human emotion recognition from videos using spatio-temporal and audio features , 2012, The Visual Computer.

[32]  J. Russell A circumplex model of affect. , 1980 .

[33]  Rafael A. Calvo,et al.  Detecting Naturalistic Expressions of Nonbasic Affect Using Physiological Signals , 2012, IEEE Transactions on Affective Computing.

[34]  L. Rothkrantz,et al.  Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[35]  W. Cannon The James-Lange theory of emotions: a critical examination and an alternative theory. By Walter B. Cannon, 1927. , 1927, American Journal of Psychology.

[36]  Haizhou Li,et al.  Audio and face video emotion recognition in the wild using deep neural networks and small datasets , 2016, ICMI.

[37]  Karen Gasper,et al.  Affect as information , 2013 .

[38]  Stefan Winkler,et al.  A Probabilistic Approach to People-Centric Photo Selection and Sequencing , 2017, IEEE Transactions on Multimedia.

[39]  Rafael A. Calvo,et al.  Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications , 2010, IEEE Transactions on Affective Computing.

[40]  Haizhou Li,et al.  Multimodal Prediction of Affective Dimensions via Fusing Multiple Regression Techniques , 2017, INTERSPEECH.

[41]  Nancy F. Chen,et al.  Hierarchical Character Embeddings: Learning Phonological and Semantic Representations in Languages of Logographic Origin Using Recursive Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  D. Keltner,et al.  Social Functions of Emotions at Four Levels of Analysis , 1999 .

[43]  Haizhou Li,et al.  Visual Speech Emotion Conversion using Deep Learning for 3D Talking Head , 2018, Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data.

[44]  Mohammad H. Mahoor,et al.  AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[45]  Haizhou Li,et al.  Acoustic emotion recognition based on fusion of multiple feature-dependent deep Boltzmann machines , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[46]  Raj Kumar Gupta,et al.  Predicting and Understanding News Social Popularity with Emotional Salience Features , 2019, ACM Multimedia.

[47]  Michael Wagner,et al.  Cross-Cultural Depression Recognition from Vocal Biomarkers , 2016, INTERSPEECH.

[48]  Bingjie Liu,et al.  Should Machines Express Sympathy and Empathy? Experiments with a Health Advice Chatbot , 2018, Cyberpsychology Behav. Soc. Netw..

[49]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[50]  Loïc Kessous,et al.  Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis , 2010, Journal on Multimodal User Interfaces.

[51]  A. Ortony,et al.  What's basic about basic emotions? , 1990, Psychological review.

[52]  Daniel McDuff,et al.  A large-scale analysis of sex differences in facial expressions , 2017, PloS one.

[53]  Tran Huy Dat,et al.  Single and multi-channel approaches for distant speech recognition under noisy reverberant conditions: I2R'S system description for the ASpIRE challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[54]  Saif Mohammad,et al.  Emotion Intensities in Tweets , 2017, *SEMEVAL.

[55]  Antonia Hamilton,et al.  Recognition of Emotions in Autism: A Formal Meta-Analysis , 2013, Journal of autism and developmental disorders.

[56]  Lucy I Mullin,et al.  A female advantage in the recognition of emotional facial expressions: test of an evolutionary hypothesis , 2006 .

[57]  C. Darwin The Expression of the Emotions in Man and Animals , .

[58]  Rok Blagus,et al.  Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models , 2015, BMC Bioinformatics.

[59]  Bin Ma,et al.  Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Douglas Schrock,et al.  Gender and Emotions , 2014 .

[61]  Raj Kumar Gupta,et al.  CrystalFeel at SemEval-2018 Task 1: Understanding and Detecting Emotion Intensity using Affective Lexicons , 2018, *SEMEVAL.

[62]  Haizhou Li,et al.  Mobile acoustic Emotion Recognition , 2016, 2016 IEEE Region 10 Conference (TENCON).

[63]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[64]  Stefan Winkler,et al.  Emotion-based sequence of family photos , 2012, ACM Multimedia.

[65]  N. Frijda,et al.  The structure of subjective emotional intensity , 1994 .

[66]  Stefan Wermter,et al.  The OMG-Emotion Behavior Dataset , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[67]  D. Buss,et al.  Sex Differences in Disgust: Why Are Women More Easily Disgusted Than Men? , 2018 .

[68]  Thierry Pun,et al.  Multimodal Emotion Recognition in Response to Videos , 2012, IEEE Transactions on Affective Computing.

[69]  Qiang Ji,et al.  Hybrid video emotional tagging using users’ EEG and video content , 2014, Multimedia Tools and Applications.

[70]  Richeng Duan,et al.  Senone-Aware Adversarial Multi-Task Training for Unsupervised Child to Adult Speech Adaptation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  Friedhelm Schwenker,et al.  Fusion of Fragmentary Classifier Decisions for Affective State Recognition , 2012, MPRSS.

[72]  Vibhu O. Mittal,et al.  Comparative Experiments on Sentiment Classification for Online Product Reviews , 2006, AAAI.

[73]  Andrew Ortony,et al.  The Cognitive Structure of Emotions , 1988 .

[74]  Philippe Verduyn,et al.  Intensity and Duration of Negative Emotions: Comparing the Role of Appraisals and Regulation Strategies , 2014, PloS one.

[75]  Judith A. Hall,et al.  Gender and emotion in context. , 2008 .

[76]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[77]  Nancy F. Chen,et al.  Multimodal neural pronunciation modeling for spoken languages with logographic origin , 2018, EMNLP.

[78]  E. McClure A meta-analytic review of sex differences in facial expression processing and their development in infants, children, and adolescents. , 2000, Psychological bulletin.

[79]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[80]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[81]  A. Ortony,et al.  Basic emotions: can conflicting criteria converge? , 1992, Psychological review.

[82]  Raj Kumar Gupta,et al.  What Constitutes Happiness? Predicting and Characterizing the Ingredients of Happiness Using Emotion Intensity Analysis , 2019, AffCon@AAAI.

[83]  P. Ekman An argument for basic emotions , 1992 .

[84]  Joep Sonnemans,et al.  The duration of affective phenomena or emotions, sentiments and passions , 1991 .

[85]  Antonio Camurri,et al.  Toward a Minimal Representation of Affective Gestures , 2011, IEEE Transactions on Affective Computing.

[86]  Richeng Duan,et al.  Unsupervised Feature Adaptation Using Adversarial Multi-Task Training for Automatic Evaluation of Children's Speech , 2020, INTERSPEECH.