Improving Human Action Recognition through Hierarchical Neural Network Classifiers

Automatic understanding of videos is one of the complex problems in machine learning and computer vision. An important area in the field of video analysis is human action recognition (HAR). Though a large number of HAR systems have already been developed, there is plenty of daily life actions that are difficult to recognize, due to several reasons, such as recording on different devices, poor video quality and similarities among actions. Development in the field of deep learning, especially in convolutional neural networks (CNN), has provided us with methods that are well-suited for the tasks of image and video recognition. This work implements a CNN-based hierarchical recognition approach to recognize 20 most difficult-to-recognize actions from the Kinetics dataset. Experimental results have shown that the application of our method significantly improves the quality of recognition for these actions.

[1]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Lin Li,et al.  End-to-end Video-level Representation Learning for Action Recognition , 2017, 2018 24th International Conference on Pattern Recognition (ICPR).

[3]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[4]  Seok-Won Lee,et al.  User-Independent Activity Recognition via Three-Stage GA-Based Feature Selection , 2014, Int. J. Distributed Sens. Networks.

[5]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[6]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[7]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  C. Chapman,et al.  Can we reconstruct mean and eddy fluxes from Argo floats , 2017, 1706.00937.

[9]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[10]  Yannick,et al.  Tracer Kinetic Models as Temporal Constraints during DCE-MRI reconstruction , 2017 .

[11]  Larry S. Davis,et al.  Recognizing actions by shape-motion prototype trees , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[13]  Limin Wang,et al.  A Comparative Study of Encoding, Pooling and Normalization Methods for Action Recognition , 2012, ACCV.

[14]  Arnaldo de Albuquerque Araújo,et al.  Content-Based Filtering for Video Sharing Social Networks , 2011, ArXiv.

[15]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[16]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17]  Seok-Won Lee,et al.  Weed Image Classification using Wavelet Transform, Stepwise Linear Discriminant Analysis, and Support Vector Machines for an Automatic Spray Control System , 2014, J. Inf. Sci. Eng..

[18]  Wei Pan,et al.  Study on human Action Recognition Algorithms in videos , 2015 .

[19]  Adil Mehmood Khan,et al.  Multi-label Class-imbalanced Action Recognition in Hockey Videos via 3D Convolutional Neural Networks , 2017, 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[20]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Nikola Bogunovic,et al.  A review of feature selection methods with applications , 2015, 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[24]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[25]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Jean-Luc Dugelay,et al.  Learned vs. Hand-Crafted Features for Pedestrian Gender Recognition , 2015, ACM Multimedia.

[27]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Zhe Gan,et al.  Stochastic Gradient Monomial Gamma Sampler , 2017, ICML.

[29]  Chenliang Xu,et al.  A Study of Actor and Action Semantic retention in Video Supervoxel Segmentation , 2013, Int. J. Semantic Comput..

[30]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Asim Kadav,et al.  Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Heng Wang,et al.  SLAC: A Sparsely Labeled Dataset for Action Classification and Localization , 2017, ArXiv.

[33]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[34]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[35]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[37]  Luc Van Gool,et al.  Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification , 2017, ArXiv.

[38]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[39]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Saeid Nahavandi,et al.  Supervised learning probabilistic Latent Semantic Analysis for human motion analysis , 2013, Neurocomputing.

[42]  Richard O. Hill Elementary Linear Algebra With Applications , 1991 .

[43]  Mubarak Shah,et al.  Learning object motion patterns for anomaly detection and improved object detection , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[45]  Jake K. Aggarwal,et al.  Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[46]  H. G. Moore,et al.  Elementary linear algebra with applications , 1980 .

[47]  Roberto Cipolla,et al.  Extracting Spatiotemporal Interest Points using Global Information , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[48]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.

[51]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[52]  Sonali,et al.  Research Paper on Basic of Artificial Neural Network , 2014 .

[53]  Sergio Decherchi,et al.  Distributed Kernel K-Means for Large Scale Clustering , 2017, ICAISC 2017.

[54]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[55]  Yang Wang,et al.  Human action recognition from a single clip per action , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[56]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[58]  R. Tailleux,et al.  On the local view of atmospheric available potential energy , 2017, 1711.08660.