Multimodal Continuous Prediction of Emotions in Movies using Long Short-Term Memory Networks

Predicting emotions that movies are designed to evoke, can be useful in entertainment applications such as content personalization, video summarization and ad placement. Multimodal input, primarily audio and video, helps in building the emotional content of a movie. Since the emotion is built over time by audio and video, the temporal context of these modalities is an important aspect in modeling it. In this paper, we use Long Short-Term Memory networks (LSTMs) to model the temporal context in audio-video features of movies. We present continuous emotion prediction results using a multimodal fusion scheme on an annotated dataset of Academy Award winning movies. We report a significant improvement over the state-of-the-art results, wherein the correlation between predicted and annotated values is improved from 0.62 vs 0.84 for arousal, and from 0.29 to 0.50 for valence.

[1]  Emmanuel Dellandréa,et al.  Affective Video Content Analysis: A Multidisciplinary Insight , 2018, IEEE Transactions on Affective Computing.

[2]  Alan Hanjalic,et al.  Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.

[3]  Scott D. Lipscomb,et al.  Perceptual judgement of the relationship between musical and visual components in film. , 1994 .

[4]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[5]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[6]  Athanasia Zlatintsi,et al.  A supervised approach to movie emotion tracking , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Dong Pyo Jang,et al.  EEG-based neurocinematics: challenges and prospects , 2015 .

[8]  David Bordwell The Way Hollywood Tells It: Story and Style in Modern Movies , 2006 .

[9]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[10]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[11]  Harish Katti,et al.  CAVVA: Computational Affective Video-in-Video Advertising , 2014, IEEE Transactions on Multimedia.

[12]  Riccardo Leonardi,et al.  Affective Recommendation of Movies Based on Selected Connotative Features , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Shiliang Zhang,et al.  Affective Visualization and Retrieval for Music Video , 2010, IEEE Transactions on Multimedia.

[14]  J. Russell,et al.  Evidence for a three-factor theory of emotions , 1977 .

[15]  Emmanuel Dellandréa,et al.  LIRIS-ACCEDE: A Video Database for Affective Content Analysis , 2015, IEEE Transactions on Affective Computing.

[16]  Tanaya Guha,et al.  A multimodal mixture-of-experts model for dynamic emotion prediction in movies , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  P. Lang,et al.  Affective judgment and psychophysiological response: Dimensional covariation in the evaluation of pictorial stimuli. , 1989 .