Timed-image based deep learning for action recognition in video sequences

Abstract The paper addresses two issues relative to machine learning on 2D + X data volumes, where 2D refers to image observation and X denotes a variable that can be associated with time, depth, wavelength, etc. The first issue addressed is conditioning these structured volumes for compatibility with respect to convolutional neural networks operating on 2D image file formats. The second issue is associated with sensitive action detection in the “2D + Time” case (video clips and image time series). For the data conditioning issue, the paper first highlights that referring 2D spatial convolution to its 1D Hilbert based instance is highly accurate for information compressibility upon tight frames of convolutional networks. As a consequence of this compressibility, the paper proposes converting the 2D + X data volume into a single meta-image file format, prior to machine learning frameworks. This conversion is such that any 2D frame of the 2D + X data is reshaped as a 1D array indexed by a Hilbert space-filling curve and the third variable X of the initial file format becomes the second variable in the meta-image format. For the sensitive action recognition issue, the paper provides: (i) a 3 category video database involving non-violent, moderate and extreme violence actions; (ii) the conversion of this database into a timed meta-image database from the 2D + Time to 2D conditioning stage described above and (iii) outstanding 2-level and 3-level violence classification results from deep convolutional neural networks operating on meta-image databases.

[1]  Peilun Dai,et al.  Feasibility of 3D Reconstruction of Neural Morphology Using Expansion Microscopy and Barcode-Guided Agglomeration , 2017, Front. Comput. Neurosci..

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Kien A. Hua,et al.  Convolutional DLSTM for Crowd Scene Understanding , 2017, 2017 IEEE International Symposium on Multimedia (ISM).

[4]  Song-Chun Zhu,et al.  Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation , 2017, AAAI.

[5]  Muhammad Atif Tahir,et al.  Detection of Violent Content in Cartoon Videos Using Multimedia Content Detection Techniques , 2018, 2018 IEEE 21st International Multi-Topic Conference (INMIC).

[6]  Li Liu,et al.  Lossless compression of medical images using Hilbert space-filling curves , 2008, Comput. Medical Imaging Graph..

[7]  Haifeng Hu,et al.  Domain learning joint with semantic adaptation for human action recognition , 2019, Pattern Recognit..

[8]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Linchao He,et al.  Learning motion representation for real-time spatio-temporal action localization , 2020, Pattern Recognit..

[10]  Jing Yu,et al.  A Novel Violent Video Detection Scheme Based on Modified 3D Convolutional Neural Networks , 2019, IEEE Access.

[11]  Lei Yang,et al.  A review of bloody violence in video classification , 2017, 2017 International Conference on the Frontiers and Advances in Data Science (FADS).

[12]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[13]  Sung Wook Baik,et al.  Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features , 2018, IEEE Access.

[14]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Vrijendra Singh,et al.  Real-time crowd behavior detection using SIFT feature extraction technique in video sequences , 2017, 2017 International Conference on Intelligent Computing and Control Systems (ICICCS).

[17]  J. Valantinas ON THE USE OF SPACE-FILLING CURVES IN CHANGING IMAGE DIMENSIONALITY , 2005 .

[18]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jonathon S. Hare,et al.  Detecting acceleration for gait and crime scene analysis , 2016, ICDP.

[20]  Emmanuel Dellandréa,et al.  The MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[21]  Javed Iqbal,et al.  Low level visio-temporal features for violence detection in cartoon videos , 2016, 2016 Sixth International Conference on Innovative Computing Technology (INTECH).

[22]  Tobias Senst,et al.  Crowd Violence Detection Using Global Motion-Compensated Lagrangian Features and Scale-Sensitive Video-Level Representation , 2017, IEEE Transactions on Information Forensics and Security.

[23]  Hao Yang,et al.  Time-Asymmetric 3d Convolutional Neural Networks for Action Recognition , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[24]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[25]  Jiwen Lu,et al.  Learning principal orientations and residual descriptor for action recognition , 2019, Pattern Recognit..

[26]  Markus Schedl,et al.  VSD2014: A dataset for violent scenes detection in hollywood movies and web videos , 2015, 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI).

[27]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  Heng Tao Shen,et al.  Order-aware Convolutional Pooling for Video Based Action Recognition , 2016, Pattern Recognit..

[30]  Noel E. O'Connor,et al.  Holistic features for real-time crowd behaviour anomaly detection , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[31]  Yiannis Andreopoulos,et al.  Compressed-domain video classification with deep neural networks: “There's way too much information to decode the matrix” , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[32]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Xianglong Liu,et al.  Spatio-temporal deformable 3D ConvNets with attention for action recognition , 2020, Pattern Recognit..

[35]  Markus Schedl,et al.  Benchmarking Violent Scenes Detection in movies , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[36]  Vanessa Testoni,et al.  Temporal Robust Features for Violence Detection , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).