Multimodal Violence Detection in Videos

Effective tools for detection of violence are highly demanded, specially when dealing with video streams. Such tools have a wide range of applications, from forensics and law enforcement to parental control over the ever increasing amount of videos available online. Prior studies showed that deep learning has great potential in detecting violence, but focuses on detecting violence in general, or only specific cases of violent behavior. While the concept of violence is broad and highly subjective, simpler concepts such as fights, explosions, and gunshots, convey the idea of violence while being more objective. Even though different concepts relate to this same broader idea of violence, they differ widely in relation to whether or not they convey the idea of movement, the presence of a specific object, or even if they generate distinctive sounds. In this study, we propose to analyze different concepts related to violence and how to better describe these concepts exploring visual and auditory cues in order to reach a robust method to detect violence.

[1]  Markus Schedl,et al.  RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach , 2015, MediaEval.

[2]  Anderson Rocha,et al.  Breaking down violence: A deep-learning strategy to model and classify violence in videos , 2018, ARES.

[3]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[4]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[5]  Xi Wang,et al.  Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning , 2015, MediaEval.

[6]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[7]  Xavier Serra,et al.  ESSENTIA: an open-source library for sound and music analysis , 2013, ACM Multimedia.

[8]  Markus Schedl,et al.  The MediaEval 2013 Affect Task: Violent Scenes Detection , 2013, MediaEval.

[9]  Vu Lam,et al.  NII-UIT at MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[10]  Rahul Sukthankar,et al.  Violence Detection in Video Using Computer Vision Techniques , 2011, CAIP.

[11]  Tal Hassner,et al.  Violent flows: Real-time detection of violent crowd behavior , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Chun Fei,et al.  Real Time Violence Detection Based on Deep Spatio-Temporal Features , 2018, CCBR.

[13]  Wen-Huang Cheng,et al.  Semantic context detection based on hierarchical audio models , 2003, MIR '03.

[14]  Yunhong Wang,et al.  Multi-stream Deep Networks for Person to Person Violence Detection in Videos , 2016, CCPR.

[15]  Prospero C. Naval,et al.  DOVE : Detection of Movie Violence using Motion Intensity Analysis on Skin and Blood , 2006 .

[16]  Rainer Stiefelhagen,et al.  KIT at MediaEval 2015 - Evaluating Visual Cues for Affective Impact of Movies Task , 2015, MediaEval.

[17]  Arnaldo de Albuquerque Araújo,et al.  Color-Aware Local Spatiotemporal Features for Action Recognition , 2011, CIARP.

[18]  Anderson Rocha,et al.  Toward Subjective Violence Detection in Videos , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Xirong Li,et al.  RUCMM at MediaEval 2015 Affective Impact of Movies Task: Fusion of Audio and Visual Cues , 2015, MediaEval.

[20]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[21]  Augusto Sarti,et al.  Automatic Reliability Estimation for Speech Audio Surveillance Recordings , 2019, 2019 IEEE International Workshop on Information Forensics and Security (WIFS).

[22]  Bowen Zhang,et al.  MIC-TJU in MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[23]  Anitha Edison,et al.  Optical Acceleration for Motion Description in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25]  Lie Lu,et al.  Music type classification by spectral contrast feature , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[26]  Markus Schedl,et al.  Benchmarking Violent Scenes Detection in movies , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).