Automatic Detection of Violence in Video Scenes

Violence detection from video stream using machine learning is a very rising field of inspection due to its great contribution to achieving peace and security and saving people's lives by automatically early-detecting violent acts and alarming those responsible to interfere. Knowing its significance, A Convolutional Neural Network (CNN) has been designed and implemented to tackle this problem, along with two classical classifiers, namely, support vector machine and random forest, in order to detect violence in video streams. The CNN is used both as a classifier and as a feature extractor whose outputs are fed into the other two classifiers. A data structure called ‘packets' was developed to refer to the input of our CNN network, packets are used to train the model on short clips where a packet consists of 15 sampled frames that make up one second of a video. So the input to the CNN is a subsampling of 3D volume of video frames. The problem is posed as a binary classification. Using four different datasets one of which is a combination of YouTube videos collected and annotated by the authors mixed with a dataset found on kaggle but also filtered by the authors. In addition, three other benchmark datasets is used. Our model was trained using supervised learning on a set of normal and violent videos, and tested using three separate classifiers whose results are compared together and then compared to other state-of-the-art approaches. Finally, transfer learning was implemented via cross validating models that where trained on a certain dataset and tested on another.