A multimodal mixture-of-experts model for dynamic emotion prediction in movies

This paper addresses the problem of continuous emotion prediction in movies from multimodal cues. The rich emotion content in movies is inherently multimodal, where emotion is evoked through both audio (music, speech) and video modalities. To capture such affective information, we put forth a set of audio and video features that includes several novel features such as, Video Compressibility and Histogram of Facial Area (HFA). We propose a Mixture of Experts (MoE)-based fusion model that dynamically combines information from the audio and video modalities for predicting the emotion evoked in movies. A learning module, based on hard Expectation-Maximization (EM) algorithm, is presented for the MoE model. Experiments on a database of popular movies demonstrate that our MoE-based fusion method outperforms popular fusion strategies (e.g. early and late fusion) in the context of dynamic emotion prediction.

[1]  Greg M. Smith,et al.  Passionate views : film, cognition, and emotion , 1999 .

[2]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[3]  Min Xu,et al.  Affective content analysis in comedy and horror videos by audio emotional event detection , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[4]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[5]  Li-Yun Wang,et al.  Violence Detection in Movies , 2011, 2011 Eighth International Conference Computer Graphics, Imaging and Visualization.

[6]  Tanaya Guha,et al.  Computationally deconstructing movie narratives: An informatics approach , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Petros Maragos,et al.  Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.

[8]  I. Bruner Music, Mood, and Marketing: , 1990 .

[9]  Tanaya Guha,et al.  Affective Feature Design and Predicting Continuous Affective Dimensions from Music , 2014, MediaEval.

[10]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[11]  A. Hanjalic,et al.  Extracting moods from pictures and sounds: towards truly personalized TV , 2006, IEEE Signal Processing Magazine.

[12]  Takuya Fujishima,et al.  Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music , 1999, ICMC.

[13]  Surya Nepal,et al.  Automatic detection of 'Goal' segments in basketball videos , 2001, MULTIMEDIA '01.

[14]  Jeffrey J. Scott,et al.  MUSIC EMOTION RECOGNITION: A STATE OF THE ART REVIEW , 2010 .

[15]  Svetha Venkatesh,et al.  Novel approach to determining tempo and dramatic story sections in motion pictures , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[16]  Sergios Theodoridis,et al.  Music tracking in audio streams from movies , 2008, 2008 IEEE 10th Workshop on Multimedia Signal Processing.

[17]  Athanasia Zlatintsi,et al.  A supervised approach to movie emotion tracking , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  R. Simons,et al.  Roll ‘em!: The effects of picture motion on emotional responses , 1998 .

[19]  Xiaogang Wang,et al.  Deep Convolutional Network Cascade for Facial Point Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  P. Valdez,et al.  Effects of color on emotions. , 1994, Journal of experimental psychology. General.

[21]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[22]  Hang-Bong Kang,et al.  Affective content detection using HMMs , 2003, ACM Multimedia.

[23]  Xiangyang Xue,et al.  Predicting Emotions in User-Generated Videos , 2014, AAAI.

[24]  Rosalind W. Picard Affective Computing , 1997 .