AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention

In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendAffectNet (AAN)—uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performed better than the baseline models when predicting the valence dimension.

[1]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[2]  P. Ekman,et al.  Pan-Cultural Elements in Facial Displays of Emotion , 1969, Science.

[3]  H. Fabrega,,et al.  Cross-cultural Universals of Affective Meaning , 1976 .

[4]  M. Cole Cross-cultural universals of affective meaning. , 1976 .

[5]  J. Russell A circumplex model of affect. , 1980 .

[6]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[7]  D. Watson,et al.  Toward a consensual structure of mood. , 1985, Psychological bulletin.

[8]  P. Lang Cognition in emotion: Concept and action. , 1985 .

[9]  P. Lang,et al.  Affective judgment and psychophysiological response: Dimensional covariation in the evaluation of pictorial stimuli. , 1989 .

[10]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[11]  Kenji Mase,et al.  Recognition of Facial Expression from Optical Flow , 1991 .

[12]  M. Bradley,et al.  Remembering pictures: pleasure and arousal in memory. , 1992, Journal of experimental psychology. Learning, memory, and cognition.

[13]  J. Gross,et al.  Emotion elicitation using films , 1995 .

[14]  J. Russell,et al.  Independence and bipolarity in the structure of current affect. , 1998 .

[15]  J. Vaidya,et al.  The two general activation systems of affect: Structural findings, evolutionary considerations, and psychobiological evidence , 1999 .

[16]  Massimiliano Pontil,et al.  Support Vector Machines: Theory and Applications , 2001, Machine Learning and Its Applications.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[18]  James Geller,et al.  Data Mining: Practical Machine Learning Tools and Techniques - Book Review , 2002, SIGMOD Rec..

[19]  J. Panksepp,et al.  Emotional sounds and the brain: the neuro-affective foundations of musical appreciation , 2002, Behavioural Processes.

[20]  T. Dalgleish Basic Emotions , 2004 .

[21]  Cees G. M. Snoek,et al.  Early versus late fusion in semantic video analysis , 2005, MULTIMEDIA '05.

[22]  Alan Hanjalic,et al.  Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.

[23]  Takeo Kanade,et al.  Facial Expression Analysis , 2011, AMFG.

[24]  A. Hanjalic,et al.  Extracting moods from pictures and sounds: towards truly personalized TV , 2006, IEEE Signal Processing Magazine.

[25]  Affective content detection in sitcom using subtitle and audio , 2006, 2006 12th International Multi-Media Modelling Conference.

[26]  Peter W. McOwan,et al.  A real-time automated system for the recognition of human facial expressions , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[27]  W. Dowling Emotion and Meaning in Music , 2008 .

[28]  K. Scherer,et al.  Emotions evoked by the sound of music: characterization, classification, and measurement. , 2008, Emotion.

[29]  G. Colombetti From affect programs to dynamical discrete emotions , 2009 .

[30]  Chetan Kalyan,et al.  Detecting emotional scenes using Semantic Analysis on Subtitles , 2009 .

[31]  Mohammad Soleymani,et al.  A Bayesian framework for video affective representation , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[32]  E. Tan,et al.  The emotional and cognitive effect of immersion in film viewing , 2010 .

[33]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[34]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[35]  Thomas Fillon,et al.  YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[36]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[37]  Markus Appel,et al.  Predicting Emotions and Meta-Emotions at the Movies: The Role of the Need for Affect in Audiences’ Experience of Horror and Drama , 2010, Commun. Res..

[38]  Zhen Li,et al.  Emotion recognition from an ensemble of features , 2011, Face and Gesture 2011.

[39]  Athanasia Zlatintsi,et al.  A supervised approach to movie emotion tracking , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Maja Pantic,et al.  The first facial expression recognition and analysis challenge , 2011, Face and Gesture 2011.

[41]  Pedro Martins,et al.  Being Happy, Healthy and Whole Watching Movies That Affect Our Emotions , 2011, ACII.

[42]  Ó. Gonçalves,et al.  The Emotional Movie Database (EMDB): A Self-Report and Psychophysiological Study , 2012, Applied psychophysiology and biofeedback.

[43]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[44]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[45]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[47]  Razvan Pascanu,et al.  Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[48]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[49]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[50]  P. Gomez,et al.  Music and felt emotions: How systematic pitch level variations affect the experience of pleasantness and arousal , 2014 .

[51]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[52]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[53]  Richard Socher,et al.  A Neural Network for Factoid Question Answering over Paragraphs , 2014, EMNLP.

[54]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[55]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[56]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[57]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[58]  Emmanuel Dellandréa,et al.  LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[59]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[60]  Christopher Joseph Pal,et al.  Recurrent Neural Networks for Emotion Recognition in Video , 2015, ICMI.

[61]  J. Cohn,et al.  Automated Face Analysis for Affective Computing , 2015 .

[62]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[63]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[64]  Erik Cambria,et al.  Towards an intelligent framework for multimodal affective data analysis , 2015, Neural Networks.

[65]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[66]  Alessandro Moschitti,et al.  Twitter Sentiment Analysis with Deep Convolutional Neural Networks , 2015, SIGIR.

[67]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[68]  Florian Eyben,et al.  Real-time Speech and Music Classification by Large Audio Feature Space Extraction , 2015 .

[69]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[70]  Emmanuel Dellandréa,et al.  Deep learning vs. kernel methods: Performance for emotion prediction in videos , 2015, ACII.

[71]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[72]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[73]  Tal Hassner,et al.  Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns , 2015, ICMI.

[74]  Qin Jin,et al.  RUC at MediaEval 2016 Emotional Impact of Movies Task: Fusion of Multimodal Features , 2016, MediaEval.

[75]  Ling Shao,et al.  Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach , 2016, IEEE Transactions on Cybernetics.

[76]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Mingxing Xu,et al.  THU-HCSI at MediaEval 2016: Emotional Impact of Movies Task , 2016, MediaEval.

[78]  Thomas S. Huang,et al.  How deep neural networks can improve emotion recognition on video data , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[79]  Tanaya Guha,et al.  A multimodal mixture-of-experts model for dynamic emotion prediction in movies , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[80]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[81]  Yan Liu,et al.  Mining Emotional Features of Movies , 2016, MediaEval.

[82]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[83]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[84]  Elaine Chew,et al.  Tension ribbons: Quantifying and visualising tonal tension. , 2016 .

[85]  Lin Wu,et al.  Iterative Views Agreement: An Iterative Low-Rank Based Structured Optimization Method to Multi-View Spectral Clustering , 2016, IJCAI.

[86]  Emmanuel Dellandréa,et al.  The MediaEval 2016 Emotional Impact of Movies Task , 2016, MediaEval.

[87]  Gaurav Sharma,et al.  LOMo: Latent Ordinal Model for Facial Analysis in Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  M. Duffy,et al.  Practices of emotional and affective geographies of sound , 2016 .

[89]  Ching-Hua Chuan,et al.  IMMA-Emo: A Multimodal Interface for Visualising Score- and Audio-synchronised Emotion Annotations , 2017, Audio Mostly Conference.

[90]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[91]  Anoop Cherian,et al.  Ordered Pooling of Optical Flow Sequences for Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[92]  Lin Wu,et al.  Unsupervised Metric Fusion Over Multiview Data by Graph Random Walk-Based Cross-View Diffusion , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[93]  Petros Maragos,et al.  COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization , 2017, EURASIP J. Image Video Process..

[94]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[95]  Ping Hu,et al.  Learning supervised scoring ensemble for emotion recognition in the wild , 2017, ICMI.

[96]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  Alan S. Cowen,et al.  Self-report captures 27 distinct categories of emotion bridged by continuous gradients , 2017, Proceedings of the National Academy of Sciences.

[98]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[99]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[100]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[101]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[102]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[103]  Albert Ali Salah,et al.  Video-based emotion recognition in the wild using deep transfer learning and score fusion , 2017, Image Vis. Comput..

[104]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[105]  Jian Huang,et al.  Improving speech emotion recognition via Transformer-based Predictive Coding through transfer learning , 2018, ArXiv.

[106]  Guoqiang Xu,et al.  Multimodal Emotion Recognition for One-Minute-Gradual Emotion Challenge , 2018, ArXiv.

[107]  Minghao Wang,et al.  Multi-Feature Based Emotion Recognition for Video Clips , 2018, ICMI.

[108]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[109]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[110]  Hanli Wang,et al.  Multi-modal learning for affective content analysis in movies , 2018, Multimedia Tools and Applications.

[111]  Tao Yang,et al.  Word Embedding for Understanding Natural Language: A Survey , 2018 .

[112]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[113]  Md. Saiful Islam,et al.  A Comparative Analysis of Word Embedding Representations in Authorship Attribution of Bengali Literature , 2018, 2018 21st International Conference of Computer and Information Technology (ICCIT).

[114]  Victor O. K. Li,et al.  Multi-Region Ensemble Convolutional Neural Network for Facial Expression Recognition , 2018, ICANN.

[115]  Niranjan Pedanekar,et al.  Multimodal Continuous Prediction of Emotions in Movies using Long Short-Term Memory Networks , 2018, ICMR.

[116]  Rico Sennrich,et al.  Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures , 2018, EMNLP.

[117]  Luyao Huang,et al.  Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence , 2019, NAACL.

[118]  Sergio Escalera,et al.  Audio-Visual Emotion Recognition in Video Clips , 2019, IEEE Transactions on Affective Computing.

[119]  Frank Rudzicz,et al.  Generative Adversarial Networks for Text Using Word2vec Intermediaries , 2019, RepL4NLP@ACL.

[120]  Dorien Herremans,et al.  Multimodal Deep Models for Predicting Affective Responses Evoked by Movies , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[121]  Ao Feng,et al.  Target-Dependent Sentiment Classification With BERT , 2019, IEEE Access.

[122]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[123]  Philip S. Yu,et al.  BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis , 2019, NAACL.

[124]  L. Polyanskaya Multimodal emotion recognition for video content , 2019 .

[125]  Zheng Wang,et al.  Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network , 2019, Sensors.

[126]  Spatiotemporal Attention Networks for Wind Power Forecasting , 2019, 2019 International Conference on Data Mining Workshops (ICDMW).

[127]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[128]  L. Fernández-Aguilar,et al.  How effective are films in inducing positive and negative emotional states? A meta-analysis , 2019, PloS one.

[129]  Dorien Herremans,et al.  Regression-based Music Emotion Prediction using Triplet Neural Networks , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[130]  Hanli Wang,et al.  Affective Video Content Analysis With Adaptive Fusion Recurrent Network , 2020, IEEE Transactions on Multimedia.

[131]  Raja Chiky,et al.  Movies Emotional Analysis Using Textual Contents , 2020, NLDB.

[132]  Long Ye,et al.  Global Affective Video Content Regression Based on Complementary Audio-Visual Features , 2020, MMM.

[133]  Yang Wang,et al.  Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry, and Fusion , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[134]  Dorien Herremans,et al.  AttendAffectNet: Self-Attention based Networks for Predicting Affective Responses from Movies , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[135]  Dimos Makris,et al.  Generating Lead Sheets with Affect: A Novel Conditional seq2seq Framework , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[136]  Jaswinder Singh,et al.  A Primer on Word Embedding , 2021 .

[137]  Ioannis Patras,et al.  AMIGOS: A Dataset for Affect, Personality and Mood Research on Individuals and Groups , 2017, IEEE Transactions on Affective Computing.

[138]  Yong Man Ro,et al.  On-the-Fly Facial Expression Prediction using LSTM Encoded Appearance-Suppressed Dynamics , 2019 .