Violent Interaction Detection in Video Based on Deep Learning

Violent interaction detection is of vital importance in some video surveillance scenarios like railway stations, prisons or psychiatric centres. Existing vision-based methods are mainly based on hand-crafted features such as statistic features between motion regions, leading to a poor adaptability to another dataset. En lightened by the development of convolutional networks on common activity recognition, we construct a FightNet to represent the complicated visual violence interaction. In this paper, a new input modality, image acceleration field is proposed to better extract the motion attributes. Firstly, each video is framed as RGB images. Secondly, optical flow field is computed using the consecutive frames and acceleration field is obtained according to the optical flow field. Thirdly, the FightNet is trained with three kinds of input modalities, i.e., RGB images for spatial networks, optical flow images and acceleration images for temporal networks. By fusing results from different inputs, we conclude whether a video tells a violent event or not. To provide researchers a common ground for comparison, we have collected a violent interaction dataset (VID), containing 2314 videos with 1077 fight ones and 1237 no-fight ones. By comparison with other algorithms, experimental results demonstrate that the proposed model for violent interaction detection shows higher accuracy and better robustness.

[1]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Terumasa Aoki,et al.  Violent Scenes Detection Using Mid-Level Violence Clustering , 2014 .

[3]  Oscar Déniz-Suárez,et al.  Fast violence detection in video , 2015, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[4]  Li-Yun Wang,et al.  Violence Detection in Movies , 2011, 2011 Eighth International Conference Computer Graphics, Imaging and Visualization.

[5]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[6]  Stephen Chi-fai Chan,et al.  Automatic Fight Detection Based on Motion Analysis , 2015, 2015 IEEE International Symposium on Multimedia (ISM).

[7]  Vu Lam,et al.  Evaluation of multiple features for violent scenes detection , 2017, Multimedia Tools and Applications.

[8]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[10]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[11]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[12]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Rahul Sukthankar,et al.  Violence Detection in Video Using Computer Vision Techniques , 2011, CAIP.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[19]  Tae-Kyun Kim,et al.  Fast Fight Detection , 2015, PloS one.

[20]  Georges Quénot,et al.  Joint Audio-Visual Words for Violent Scenes Detection in Movies , 2014, ICMR.

[21]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[22]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[23]  Xiangjian He,et al.  A new method for violence detection in surveillance scenes , 2015, Multimedia Tools and Applications.