Spatio-temporal elastic cuboid trajectories for efficient fight recognition using Hough forests

While action recognition has become an important line of research in computer vision, the recognition of particular events such as aggressive behaviors, or fights, has been relatively less studied. These tasks may be exceedingly useful in some video surveillance scenarios such as psychiatric centers, prisons or even in personal camera smartphones. Their potential usability has caused a surge of interest in developing fight or violence detectors. The key aspect in this case is efficiency, that is, these methods should be computationally very fast. In this paper, spatio-temporal elastic cuboid trajectories are proposed for fight recognition. This method is based on the use of blob movements to create trajectories that capture and model the different motions that are specific to a fight. The proposed method is robust to the specific shapes and positions of the individuals. Additionally, the standard Hough forests classifier is adapted in order to use it with this descriptor. This method is compared to other nine related methods on four datasets. The results show that the proposed method obtains the best accuracy for each dataset and is also computationally efficient.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Weiqiang Wang,et al.  Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training , 2009, PCM.

[3]  Luc Van Gool,et al.  Variations of a Hough-Voting Action Recognition System , 2010, ICPR Contests.

[4]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[5]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Sergios Theodoridis,et al.  Violence Content Classification Using Audio Features , 2006, SETN.

[9]  Lijun Yin,et al.  Multi-scale Topological Features for Hand Posture Representation and Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[11]  Sergios Theodoridis,et al.  Audio-Visual Fusion for Detecting Violent Scenes in Videos , 2010, SETN.

[12]  Liang-Hua Chen,et al.  Violent Scene Detection in Movies , 2011, Int. J. Pattern Recognit. Artif. Intell..

[13]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  Tal Hassner,et al.  Violent flows: Real-time detection of violent crowd behavior , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[15]  Wen-Huang Cheng,et al.  Semantic context detection based on hierarchical audio models , 2003, MIR '03.

[16]  Oscar Déniz-Suárez,et al.  VISILAB at MediaEval 2013: Fight Detection , 2013, MediaEval.

[17]  Tobias Senst,et al.  A local feature based on lagrangian measures for violent video classification , 2015, ICDP.

[18]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  Johannes D. Krijnders,et al.  CASSANDRA: audio-video sensor fusion for aggression detection , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[20]  Luc Van Gool,et al.  Hough Forests for Object Detection, Tracking, and Action Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Wen Gao,et al.  Detecting Violent Scenes in Movies by Auditory and Visual Cues , 2008, PCM.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Prospero C. Naval,et al.  DOVE : Detection of Movie Violence using Motion Intensity Analysis on Skin and Blood , 2006 .

[24]  Jeho Nam,et al.  Audio-visual content-based violent scene characterization , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[25]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[26]  Martial Hebert,et al.  Trajectons: Action recognition through the motion analysis of tracked features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[27]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[28]  Oscar Déniz-Suárez,et al.  Fast violence detection in video , 2015, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[29]  Rahul Sukthankar,et al.  Violence Detection in Video Using Computer Vision Techniques , 2011, CAIP.

[30]  Tanaya Guha,et al.  Learning Sparse Representations for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Ming-yu Chen,et al.  Recognition of aggressive human behavior using binary local motion descriptors , 2008, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[33]  Jenq-Neng Hwang,et al.  A Review on Video-Based Human Activity Recognition , 2013, Comput..

[34]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[35]  Alessandro Perina,et al.  Violence detection in crowded scenes using substantial derivative , 2015, 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[36]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[37]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[38]  Can Wang,et al.  Violence detection using Oriented VIolent Flows , 2016, Image Vis. Comput..

[39]  Rahul Sukthankar,et al.  Exploiting multi-level parallelism for low-latency activity recognition in streaming video , 2010, MMSys '10.

[40]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[41]  Tae-Kyun Kim,et al.  Fast Fight Detection , 2015, PloS one.

[42]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[43]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Mohammad Soleymani,et al.  The MediaEval 2011 Affect Task: Violent Scenes Detection in Hollywood movies , 2011, MediaEval.