Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning

Techniques for violent scene detection and aective impact prediction in videos can be deployed in many applications. In MediaEval 2015, we explore deep learning methods to tackle this challenging problem. Our system consists of several deep learning features. First, we train a Convolutional Neural Network (CNN) model with a subset of ImageNet classes selected particularly for violence detection. Second, we adopt a specially designed two-stream CNN framework [1] to extract features on both static frames and motion optical ows. Third, Long Short Term Memory (LSTM) models are applied on top of the two-stream CNN features, which can capture the longer-term temporal dynamics. In addition, several conventional motion and audio features are also extracted as complementary information to the deep learning features. By fusing all the advanced features, we achieve a mean average precision of 0.296 in the violence detection subtask, and an accuracy of 0.418 and 0.488 for arousal and valence respectively in the induced aect detection subtask.