Coarse-to-Fine Aggregation for Cross-Granularity Action Recognition

In this paper, we introduce a novel hierarchical aggregation design that captures different levels of temporal granularity in action recognition. Our design principle is coarse-to-fine and achieved using a tree-structured network; as we traverse this network top-down, pooling operations are getting less invariant but timely more resolute and well localized. Learning the combination of operations in this network - which best fits a given ground-truth - is obtained by solving a constrained minimization problem whose solution corresponds to the distribution of weights that capture the contribution of each level (and thereby temporal granularity) in the global hierarchical pooling process. Besides being principled and well grounded, the proposed hierarchical pooling is also video-length agnostic and resilient to misalignments in actions. Extensive experiments conducted on the challenging UCF-101 database corroborate these statements.

[1]  James J. Little,et al.  Simultaneous Tracking and Action Recognition using the PCA-HOG Descriptor , 2006, The 3rd Canadian Conference on Computer and Robot Vision (CRV'06).

[2]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[3]  Yiannis Demiris,et al.  Prediction of intent in robotics and multi-agent systems , 2007, Cognitive Processing.

[4]  Ezzeddine Zagrouba,et al.  Abnormal behavior recognition for intelligent video surveillance systems: A review , 2018, Expert Syst. Appl..

[5]  Wei Liu,et al.  Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Hichem Sahbi,et al.  High Order Stochastic Graphlet Embedding for Graph-Based Pattern Recognition , 2017, ArXiv.

[7]  Hichem Sahbi,et al.  Semi supervised deep kernel design for image annotation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Silvio Savarese,et al.  Social Scene Understanding: End-to-End Multi-person Action Localization and Collective Activity Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Dong Xu,et al.  Event Recognition in Videos by Learning from Heterogeneous Web Sources , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  F. Fleuret,et al.  Scale-Invariance of Support Vector Machines based on the Triangular Kernel , 2001 .

[11]  Alex Pentland,et al.  Human computing and machine understanding of human behavior: a survey , 2006, ICMI '06.

[12]  Qiang Ji,et al.  Action recognition and localization with spatial and temporal contexts , 2019, Neurocomputing.

[13]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Hichem Sahbi,et al.  Kernel methods and scale invariance using the triangular kernel , 2004 .

[15]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[16]  Mehryar Mohri,et al.  Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[17]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[19]  Hichem Sahbi,et al.  From coarse to fine skin and face detection , 2000, ACM Multimedia.

[20]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[21]  Cees Snoek,et al.  Pointly-Supervised Action Localization , 2018, International Journal of Computer Vision.

[22]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[23]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Hongying Meng,et al.  A Human Action Recognition System for Embedded Computer Vision Application , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Huosheng Hu,et al.  Ubiquitous robotics in physical human action recognition: A comparison between dynamic ANNs and GP , 2008, 2008 IEEE International Conference on Robotics and Automation.

[27]  Dong Xu,et al.  Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Wei Liu,et al.  Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Gang Pan,et al.  Deep Attention Network for Egocentric Action Recognition , 2019, IEEE Transactions on Image Processing.

[32]  Hichem Sahbi,et al.  A particular Gaussian mixture model for clustering and its application to image retrieval , 2008, Soft Comput..

[33]  Cordelia Schmid,et al.  PoTion: Pose MoTion Representation for Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[35]  Gabriela Csurka,et al.  Fisher Vectors: Beyond Bag-of-Visual-Words Image Representations , 2010, VISIGRAPP.

[36]  Amit K. Roy-Chowdhury,et al.  Captioning Near-Future Activity Sequences , 2019, ArXiv.

[37]  Fabien Moutarde,et al.  Multi-users online recognition of technical gestures for natural human–robot collaboration in manufacturing , 2019, Auton. Robots.

[38]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[40]  Hichem Sahbi,et al.  Deep kernel map networks for image annotation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Hichem Sahbi ImageCLEF annotation with explicit context-aware kernel maps , 2015, International Journal of Multimedia Information Retrieval.

[42]  Hichem Sahbi,et al.  Transductive Kernel Map Learning and Its Application Image Annotation , 2012, BMVC.

[43]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[44]  Changyin Sun,et al.  Supervised class-specific dictionary learning for sparse modeling in action recognition , 2012, Pattern Recognit..

[45]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[46]  Wei Liu,et al.  Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Yanning Zhang,et al.  Going deeper with two-stream ConvNets for action recognition in video surveillance , 2017, Pattern Recognit. Lett..

[48]  Adina Magda Florea,et al.  Human Action Recognition for Social Robots , 2019, 2019 22nd International Conference on Control Systems and Computer Science (CSCS).

[49]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[50]  Hichem Sahbi,et al.  Deep Temporal Pyramid Design for Action Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Xiaogang Wang,et al.  Deeply learned attributes for crowded scene understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[53]  Amjad Rehman,et al.  Hand-crafted and deep convolutional neural network features fusion and selection strategy: An application to intelligent human action recognition , 2020, Appl. Soft Comput..