Deep learning vs. kernel methods: Performance for emotion prediction in videos

Recently, mainly due to the advances of deep learning, the performances in scene and object recognition have been progressing intensively. On the other hand, more subjective recognition tasks, such as emotion prediction, stagnate at moderate levels. In such context, is it possible to make affective computational models benefit from the breakthroughs in deep learning? This paper proposes to introduce the strength of deep learning in the context of emotion prediction in videos. The two main contributions are as follow: (i) a new dataset, composed of 30 movies under Creative Commons licenses, continuously annotated along the induced valence and arousal axes (publicly available) is introduced, for which (ii) the performance of the Convolutional Neural Networks (CNN) through supervised fine-tuning, the Support Vector Machines for Regression (SVR) and the combination of both (Transfer Learning) are computed and discussed. To the best of our knowledge, it is the first approach in the literature using CNNs to predict dimensional affective scores from videos. The experimental results show that the limited size of the dataset prevents the learning or finetuning of CNN-based frameworks but that transfer learning is a promising solution to improve the performance of affective movie content analysis frameworks as long as very large datasets annotated along affective dimensions are not available.

[1]  Kristian Kroschel,et al.  Audio-visual emotion recognition using an emotion space concept , 2008, 2008 16th European Signal Processing Conference.

[2]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[3]  Justus J. Randolph Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss' Fixed-Marginal Multirater Kappa. , 2005 .

[4]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[5]  Carlos Busso,et al.  Analysis and Compensation of the Reaction Lag of Evaluators in Continuous Emotional Annotations , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[8]  Riccardo Leonardi,et al.  Affective Recommendation of Movies Based on Selected Connotative Features , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Yi-Hsuan Yang,et al.  1000 songs for emotional analysis of music , 2013, CrowdMM '13.

[10]  K. Scherer,et al.  On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common , 2013, Front. Psychol..

[11]  Razvan Pascanu,et al.  Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[12]  Shiliang Zhang,et al.  Affective Visualization and Retrieval for Music Video , 2010, IEEE Transactions on Multimedia.

[13]  Angeliki Metallinou,et al.  Annotation and processing of continuous emotional attributes: Challenges and opportunities , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[14]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[15]  Shiliang Zhang,et al.  Utilizing affective analysis for efficient movie browsing , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[16]  Roddy Cowie,et al.  Gtrace: General Trace Program Compatible with EmotionML , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[17]  Emmanuel Dellandréa,et al.  LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[18]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[19]  Alan Hanjalic,et al.  Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.

[20]  Masakazu Matsugu,et al.  Subject independent facial expression recognition with robust face detection using a convolutional neural network , 2003, Neural Networks.

[21]  M. Bradley,et al.  Measuring emotion: the Self-Assessment Manikin and the Semantic Differential. , 1994, Journal of behavior therapy and experimental psychiatry.

[22]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[23]  Athanasia Zlatintsi,et al.  A supervised approach to movie emotion tracking , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  K. Scherer Appraisal considered as a process of multilevel sequential checking. , 2001 .

[25]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.