MoWLD: a robust motion image descriptor for violence detection

Automatic violence detection from video is a hot topic for many video surveillance applications. However, there has been little success in designing an algorithm that can detect violence in surveillance videos with high performance. Existing methods typically apply the Bag-of-Words (BoW) model on local spatiotemporal descriptors. However, traditional spatiotemporal features are not discriminative enough, and also the BoW model roughly assigns each feature vector to only one visual word and therefore ignores the spatial relationships among the features. To tackle these problems, in this paper we propose a novel Motion Weber Local Descriptor (MoWLD) in the spirit of the well-known WLD and make it a powerful and robust descriptor for motion images. We extend the WLD spatial descriptions by adding a temporal component to the appearance descriptor, which implicitly captures local motion information as well as low-level image appear information. To eliminate redundant and irrelevant features, the non-parametric Kernel Density Estimation (KDE) is employed on the MoWLD descriptor. In order to obtain more discriminative features, we adopt the sparse coding and max pooling scheme to further process the selected MoWLDs. Experimental results on three benchmark datasets have demonstrated the superiority of the proposed approach over the state-of-the-arts.

[1]  Kejun Wang,et al.  Video-Based Abnormal Human Behavior Recognition—A Review , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[2]  Nuno Vasconcelos,et al.  Anomaly detection in crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Rahul Sukthankar,et al.  Violence Detection in Video Using Computer Vision Techniques , 2011, CAIP.

[4]  Guangshu Hu,et al.  Unsupervised feature selection by kernel density estimation in wavelet-based spike sorting , 2012, Biomed. Signal Process. Control..

[5]  Svetha Venkatesh,et al.  Learning and detecting activities from movement trajectories using the hierarchical hidden Markov model , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  Chunheng Wang,et al.  Action recognition via structured codebook construction , 2014, Signal Process. Image Commun..

[7]  Nikos Paragios,et al.  Trajectory-Based Representation of Human Actions , 2007, Artifical Intelligence for Human Computing.

[8]  Pinar Duygulu Sahin,et al.  A line based pose representation for human action recognition , 2013, Signal Process. Image Commun..

[9]  Tal Hassner,et al.  Violent flows: Real-time detection of violent crowd behavior , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[10]  L. R. Huesmann,et al.  Longitudinal relations between children's exposure to TV violence and their aggressive and violent behavior in young adulthood: 1977-1992. , 2003, Developmental psychology.

[11]  Wen-Huang Cheng,et al.  Semantic context detection based on hierarchical audio models , 2003, MIR '03.

[12]  Zi Huang,et al.  Multi-Feature Fusion via Hierarchical Regression for Multimedia Analysis , 2013, IEEE Transactions on Multimedia.

[13]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[14]  Manuele Bicego,et al.  Audio-Visual Event Recognition in Surveillance Video Sequences , 2007, IEEE Transactions on Multimedia.

[15]  Prospero C. Naval,et al.  DOVE : Detection of Movie Violence using Motion Intensity Analysis on Skin and Blood , 2006 .

[16]  Tapio Seppänen,et al.  Physical Violence Detection for Preventing School Bullying , 2014, Adv. Artif. Intell..

[17]  Du Tran,et al.  Human Activity Recognition with Metric Learning , 2008, ECCV.

[18]  Anupam Agrawal,et al.  A survey on activity recognition and behavior understanding in video surveillance , 2012, The Visual Computer.

[19]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[20]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[21]  Samy Bengio,et al.  Modeling individual and group actions in meetings with layered HMMs , 2006, IEEE Transactions on Multimedia.

[22]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[23]  Mario Cannataro,et al.  Protein-to-protein interactions: Technologies, databases, and algorithms , 2010, CSUR.

[24]  IEEE conference on computer vision and pattern recognition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[25]  Yun Fu,et al.  Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition , 2010, ACCV.

[26]  Dima Damen,et al.  Recognizing linked events: Searching the space of feasible explanations , 2009, CVPR 2009.

[27]  Deepu Rajan,et al.  Human action recognition using Pose-based discriminant embedding , 2012, Signal Process. Image Commun..

[28]  Xiangjian He,et al.  A new method for violence detection in surveillance scenes , 2015, Multimedia Tools and Applications.

[29]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Jeho Nam,et al.  Audio-visual content-based violent scene characterization , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[31]  Alberto Del Bimbo,et al.  Multi-scale and real-time non-parametric approach for anomaly detection and localization , 2012, Comput. Vis. Image Underst..

[32]  Shutao Li,et al.  Face recognition using Weber local descriptors , 2013, Neurocomputing.

[33]  Matti Pietikäinen,et al.  IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2009, TPAMI-2008-09-0620 1 WLD: A Robust Local Image Descriptor , 2022 .

[34]  Nicu Sebe,et al.  Optimal graph learning with partial tags and multiple features for image and video annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Dirk P. Kroese,et al.  Kernel density estimation via diffusion , 2010, 1011.2602.

[36]  Arnaldo de Albuquerque Araújo,et al.  Violence Detection in Video Using Spatio-Temporal Features , 2010, 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images.

[37]  L. Rowell Huesmann,et al.  Longitudinal Relations Between Children's Exposure to TV Violence and Their Aggressive and Violent Behavior in Young Adulthood : 1977-1992 , 2003 .

[38]  David C. Minnen,et al.  Propagation networks for recognition of partially ordered sequential action , 2004, CVPR 2004.

[39]  Mubarak Shah,et al.  Person-on-person violence detection in video data , 2002, Object recognition supported by user interaction for service robots.

[40]  Weiqiang Wang,et al.  Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training , 2009, PCM.

[41]  Anupam Agrawal,et al.  Action recognition using cuboids of interest points , 2011, 2011 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC).

[42]  Robert B. Fisher,et al.  Modelling Crowd Scenes for Event Detection , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[43]  Thomas S. Huang,et al.  Supervised translation-invariant sparse coding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[44]  Martin D. Levine,et al.  Online Dominant and Anomalous Behavior Detection in Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[46]  Peng Dai,et al.  Group Interaction Analysis in Dynamic Context$^{\ast}$ , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[47]  Biao Wang,et al.  Illumination Normalization Based on Weber's Law With Application to Face Recognition , 2011, IEEE Signal Processing Letters.

[48]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[49]  Nicu Sebe,et al.  Learning Deep Representations of Appearance and Motion for Anomalous Event Detection , 2015, BMVC.