A Context Knowledge Map Guided Coarse-to-Fine Action Recognition

Human actions involve a wide variety and a large number of categories, which leads to a big challenge in action recognition. However, according to similarities on human body poses, scenes, interactive objects, human actions can be grouped into some semantic groups, i.e. sports, cooking, etc. Therefore, in this paper, we propose a novel approach which recognizes human actions from coarse to fine. Taking full advantage of contributions from high-level semantic contexts, a context knowledge map guided recognition method is designed to realize the coarse-to-fine procedure. In the approach, we define semantic contexts with interactive objects, scenes and body motions in action videos, and build a context knowledge map to automatically define coarse-grained groups. Then fine-grained classifiers are proposed to realize accurate action recognition. The coarse-to-fine procedure narrows action categories in target classifiers, so it is beneficial to improving recognition performance. We evaluate the proposed approach on the CCV, the HMDB-51, and the UCF101 database. Experiments verify its significant effectiveness, on average, improving more than 5% of recognition precisions than current approaches. Compared with the state-of-the-art, it also obtains outstanding performance. The proposed approach achieves higher accuracies of 93.1%, 95.4% and 74.5% in the CCV, the UCF-101 and the HMDB51 database, respectively.

[1]  Silvio Savarese,et al.  Action Recognition by Hierarchical Mid-Level Action Elements , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[3]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[5]  Zi Huang,et al.  Discrete Nonnegative Spectral Clustering , 2017, IEEE Transactions on Knowledge and Data Engineering.

[6]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[7]  Nicu Sebe,et al.  Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Pong C. Yuen,et al.  Reduced Analytic Dependency Modeling: Robust Fusion for Visual Recognition , 2014, International Journal of Computer Vision.

[9]  Fei-Fei Li,et al.  Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  William J. Knottenbelt,et al.  Deep Learning for Domain-Specific Action Recognition in Tennis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[12]  Qing Zhang,et al.  A Survey on Human Motion Analysis from Depth Data , 2013, Time-of-Flight and Depth Imaging.

[13]  Philip S. Yu,et al.  Spatiotemporal Pyramid Network for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Behrooz Mahasseni,et al.  Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[16]  James A. Reggia,et al.  Robust human action recognition via long short-term memory , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[17]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[18]  Antonio Fernández-Caballero,et al.  A survey of video datasets for human action and activity recognition , 2013, Comput. Vis. Image Underst..

[19]  Juergen Gall,et al.  Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xuelong Li,et al.  Detecting Densely Distributed Graph Patterns for Fine-Grained Image Categorization , 2016, IEEE Transactions on Image Processing.

[22]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Heng Tao Shen,et al.  Hierarchical Latent Concept Discovery for Video Event Detection , 2017, IEEE Transactions on Image Processing.

[24]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[26]  Bernt Schiele,et al.  The Long-Short Story of Movie Description , 2015, GCPR.

[27]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[28]  Yu-Gang Jiang,et al.  Harnessing Object and Scene Semantics for Large-Scale Video Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[30]  Xuelong Li,et al.  From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Yu Qiao,et al.  RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Zi Huang,et al.  Web Video Event Recognition by Semantic Analysis from Ubiquitous Documents. , 2016, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[33]  Yang Yang,et al.  One-shot learning based pattern transition map for action early recognition , 2018, Signal Process..

[34]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[37]  Xuelong Li,et al.  Describing Video With Attention-Based Bidirectional LSTM , 2019, IEEE Transactions on Cybernetics.

[38]  Weiyao Lin,et al.  Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion , 2018, AAAI.

[39]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[41]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[42]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[43]  Bingbing Ni,et al.  Multiple Granularity Modeling: A Coarse-to-Fine Framework for Fine-grained Action Analysis , 2016, International Journal of Computer Vision.

[44]  Bingbing Ni,et al.  Recurrent Modeling of Interaction Context for Collective Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Dong Liu,et al.  Sample-Specific Late Fusion for Visual Category Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Shih-Fu Chang,et al.  Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Christopher Baldassano,et al.  Human‐Object Interactions Are More than the Sum of Their Parts , 2016, Cerebral cortex.

[51]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).