A Bayesian nonparametric multimodal data modeling framework for video emotion recognition

Video emotion recognition as an emerging research field has been attracting more and more focus in recent years. However, such work is quite challenging, since human emotions are hard to differentiate precisely due to its complexity and diversity, moreover, the expressions of sentiment in a content-rich video are sparse. Previous studies presented a number of approaches to try to learn human emotions on video level by exploiting various video features. However, most of works just used simple low-level video features such as hand-crafted image features, and they also did not consider the further latent connections among different multimodal data within a video. To tackle these problems, we develop a novel Bayesian non-parametric multimodal data modeling framework to learn the emotions from video, where the adopted image data are deep features extracted from key frames of video via convolutional neural networks (CNNs), and the adopted audio data are Mel-frequency cepstral coefficient (MFCC) features. In this framework, we then use a symmetric correspondence hierarchical Dirichlet processes (Sym-cHDP) model to mine their latent emotional events (topics) between image features and audio features. Finally, the effectiveness of our framework is demonstrated via comprehensive experimentations.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jiebo Luo,et al.  Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks , 2015, AAAI.

[3]  Hang-Bong Kang,et al.  Affective content detection using HMMs , 2003, ACM Multimedia.

[4]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[6]  Eric P. Xing,et al.  Symmetric Correspondence Topic Models for Multilingual Text Analysis , 2012, NIPS.

[7]  Kiyoharu Aizawa,et al.  Determination of emotional content of video clips by low-level audiovisual features , 2011, Multimedia Tools and Applications.

[8]  Xiangyang Xue,et al.  Predicting Emotions in User-Generated Videos , 2014, AAAI.

[9]  Tao Chen,et al.  DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks , 2014, ArXiv.

[10]  Stacy Marsella,et al.  EMA: A process model of appraisal dynamics , 2009, Cognitive Systems Research.

[11]  Koji Eguchi,et al.  Sequential Correspondence Hierarchical Dirichlet Processes for Video Data Analysis , 2016, ICMR.

[12]  Boyang Li,et al.  Video Emotion Recognition with Transferred Deep Feature Encodings , 2016, ICMR.

[13]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[14]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[15]  Xiangjian He,et al.  Hierarchical affective content analysis in arousal and valence dimensions , 2013, Signal Process..

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.