Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences

Human Action Recognition (HAR) is the classification of an action performed by a human. The goal of this study was to recognize human actions in action video sequences. We present a novel feature descriptor for HAR that involves multiple features and combining them using fusion technique. The major focus of the feature descriptor is to exploits the action dissimilarities. The key contribution of the proposed approach is to built robust features descriptor that can work for underlying video sequences and various classification models. To achieve the objective of the proposed work, HAR has been performed in the following manner. First, moving object detection and segmentation are performed from the background. The features are calculated using the histogram of oriented gradient (HOG) from a segmented moving object. To reduce the feature descriptor size, we take an averaging of the HOG features across non-overlapping video frames. For the frequency domain information we have calculated regional features from the Fourier hog. Moreover, we have also included the velocity and displacement of moving object. Finally, we use fusion technique to combine these features in the proposed work. After a feature descriptor is prepared, it is provided to the classifier. Here, we have used well-known classifiers such as artificial neural networks (ANNs), support vector machine (SVM), multiple kernel learning (MKL), Meta-cognitive Neural Network (McNN), and the late fusion methods. The main objective of the proposed approach is to prepare a robust feature descriptor and to show the diversity of our feature descriptor. Though we are using five different classifiers, our feature descriptor performs relatively well across the various classifiers. The proposed approach is performed and compared with the state-of-the-art methods for action recognition on two publicly available benchmark datasets (KTH and Weizmann) and for cross-validation on the UCF11 dataset, HMDB51 dataset, and UCF101 dataset. Results of the control experiments, such as a change in the SVM classifier and the effects of the second hidden layer in ANN, are also reported. The results demonstrate that the proposed method performs reasonably compared with the majority of existing state-of-the-art methods, including the convolutional neural network-based feature extractors.

[1]  Yi Zhu,et al.  Deep Local Video Feature for Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Sanjay Garg,et al.  Comparative Analysis of Traditional Methods for Moving Object Detection in Video Sequence , 2015 .

[3]  Marcos Ortega,et al.  An end-to-end deep learning approach for simultaneous background modeling and subtraction , 2019, BMVC.

[4]  Edmond Boyer,et al.  Action recognition using exemplar-based embedding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ling Shao,et al.  Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach , 2016, IEEE Transactions on Cybernetics.

[9]  Franziska Meier,et al.  3D Shape Context and Distance Transform for action recognition , 2008, 2008 19th International Conference on Pattern Recognition.

[10]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[11]  Amit K. Roy-Chowdhury,et al.  Incremental Activity Modeling and Recognition in Streaming Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[13]  Kun Liu,et al.  Rotation-Invariant HOG Descriptors Using Fourier Analysis in Polar and Spherical Coordinates , 2014, International Journal of Computer Vision.

[14]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Chirag I. Patel,et al.  Illumination Invariant Moving Object Detection , 2013 .

[16]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[18]  Sundaram Suresh,et al.  Meta-cognitive Neural Network for classification problems in a sequential learning framework , 2012, Neurocomputing.

[19]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[20]  Yihong Gong,et al.  Human action detection by boosting efficient motion features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[21]  R. Manmatha,et al.  Formulating Action Recognition as a Ranking Problem , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[22]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Tiejun Huang,et al.  Sequential Deep Trajectory Descriptor for Action Recognition With Three-Stream CNN , 2016, IEEE Transactions on Multimedia.

[24]  Ke Lu,et al.  $p$-Laplacian Regularized Sparse Coding for Human Activity Recognition , 2016, IEEE Transactions on Industrial Electronics.

[25]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Marcelo Bernardes Vieira,et al.  A Video Tensor Self-descriptor Based on Block Matching , 2015, J. Mobile Multimedia.

[27]  Ching-Tang Fan,et al.  Heterogeneous Information Fusion and Visualization for a Large-Scale Intelligent Video Surveillance System , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[28]  Martial Hebert,et al.  Spatio-temporal Shape and Flow Correlation for Action Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Aoki Yoshimitsu,et al.  Scene and Actions: Combining Multiple Deep Features for Human Action Recognition , 2016 .

[30]  Robin R. Vallacher,et al.  What do people think they're doing? Action identification and human behavior. , 1987 .

[31]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[32]  Wesley De Neve,et al.  Effective and efficient human action recognition using dynamic frame skipping and trajectory rejection , 2017, Image Vis. Comput..

[33]  Lin Li,et al.  End-to-end Video-level Representation Learning for Action Recognition , 2017, 2018 24th International Conference on Pattern Recognition (ICPR).

[34]  Tal Hassner,et al.  Motion Interchange Patterns for Action Recognition in Unconstrained Videos , 2012, ECCV.

[35]  Laurent Mascarilla,et al.  An efficient and sparse approach for large scale human action recognition in videos , 2016, Machine Vision and Applications.

[36]  Chirag I. Patel,et al.  Robust Face Detection using Fusion of Haar and Daubechies Orthogonal Wavelet Template , 2012 .

[37]  Yuan Yan Tang,et al.  A Hybrid of Local and Global Saliencies for Detecting Image Salient Region and Appearance , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[38]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[39]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[40]  Chen Sun,et al.  DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks , 2018, ArXiv.

[41]  Sargur N. Srihari,et al.  A theory of classifier combination: the neural network approach , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[42]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[43]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[44]  S. Kollias,et al.  Dense saliency-based spatiotemporal feature points for action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Marcelo Bernardes Vieira,et al.  A video tensor self-descriptor based on variable size block matching , 2015 .

[46]  S. Shankar Sastry,et al.  Compressed Domain Real-time Action Recognition , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[47]  Sahil Shah,et al.  Predicting stock market index using fusion of machine learning techniques , 2015, Expert Syst. Appl..

[48]  Anoop Cherian,et al.  Generalized Rank Pooling for Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Chirag I. Patel,et al.  Efficient Vehicle Detection and Classification for Traffic Surveillance System , 2016 .

[50]  Huafeng Chen,et al.  Multiple instance discriminative dictionary learning for action recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[53]  Rita Cucchiara,et al.  HMM Based Action Recognition with Projection Histogram Features , 2010, ICPR Contests.

[54]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[55]  Lior Wolf,et al.  Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[56]  Sung-Bae Cho,et al.  Combining multiple neural networks by fuzzy integral for robust classification , 1995, IEEE Trans. Syst. Man Cybern..

[57]  Bingbing Ni,et al.  Motion Part Regularization: Improving action recognition via trajectory group selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[59]  J. Flavell Metacognition and Cognitive Monitoring: A New Area of Cognitive-Developmental Inquiry. , 1979 .

[60]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[61]  Arnaldo de Albuquerque Araújo,et al.  A Tensor Motion Descriptor Based on Multiple Gradient Estimators , 2013, 2013 XXVI Conference on Graphics, Patterns and Images.

[62]  Yang Wang,et al.  Learning a discriminative hidden part model for human action recognition , 2008, NIPS.

[63]  Wendy E. Mackay,et al.  Virtual video editing in interactive multimedia applications , 1989, CACM.

[64]  S. Gong,et al.  Recognising action as clouds of space-time interest points , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Anoop Cherian,et al.  Ordered Pooling of Optical Flow Sequences for Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[67]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[69]  Marcelo Bernardes Vieira,et al.  A tensor motion descriptor based on histograms of gradients and optical flow , 2014, Pattern Recognit. Lett..

[70]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[71]  Lei Shi,et al.  Skeleton-Based Action Recognition With Directed Graph Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Tiziana D'Orazio,et al.  Advances in Background Updating and Shadow Removing for Motion Detection Algorithms , 2005, CAIP.

[73]  Tae-Kyun Kim,et al.  Tensor Canonical Correlation Analysis for Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[74]  Sanjay Garg,et al.  Human action recognition using fusion of features for unconstrained video sequences , 2016, Comput. Electr. Eng..

[75]  Sanjay Garg,et al.  Top-Down and Bottom-Up Cues Based Moving Object Detection for Varied Background Video Sequences , 2014, Adv. Multim..

[76]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[77]  Christoph Bregler,et al.  Motion capture assisted animation: texturing and synthesis , 2002, ACM Trans. Graph..

[78]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[79]  Keith B. Hall,et al.  Improved video categorization from text metadata and user comments , 2011, SIGIR '11.

[80]  David Picard,et al.  Local polynomial space–time descriptors for action classification , 2016, Machine Vision and Applications.

[81]  Tieniu Tan,et al.  Recent developments in human motion analysis , 2003, Pattern Recognit..

[82]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[83]  R. Patel,et al.  Gaussian mixture model based moving object detection from video sequence , 2011, ICWET.

[84]  Hongxun Yao,et al.  Distinctive action sketch for human action recognition , 2018, Signal Process..

[85]  菅野 道夫,et al.  Theory of fuzzy integrals and its applications , 1975 .

[86]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[87]  J. Aggarwal,et al.  Recognizing human action from a far field of view , 2009, 2009 Workshop on Motion and Video Computing (WMVC).

[88]  Qi Hao,et al.  Cyberphysical System With Virtual Reality for Intelligent Motion Recognition and Training , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[89]  Larry S. Davis,et al.  Recognizing actions by shape-motion prototype trees , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[90]  Ramakant Nevatia,et al.  Learning 3D action models from a few 2D videos for view invariant action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[91]  Qing Lei,et al.  A Comprehensive Survey of Vision-Based Human Action Recognition Methods , 2019, Sensors.

[92]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[93]  Sebastian Nowozin,et al.  Combining appearance and motion for human action classification in videos , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[94]  Jianbo Shi,et al.  Detecting unusual activity in video , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[95]  Ze-Nian Li,et al.  Successive Convex Matching for Action Detection , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[96]  Alberto Del Bimbo,et al.  Recognizing human actions by fusing spatio-temporal appearance and motion descriptors , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[97]  B. S. Manjunath,et al.  Video Annotation Through Search and Graph Reinforcement Mining , 2010, IEEE Transactions on Multimedia.

[98]  T. O. Nelson Metamemory: A Theoretical Framework and New Findings , 1990 .