Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors

In this paper, we build on a concept of self-supervision by taking RGB frames as input to learn to predict both action concepts and auxiliary descriptors e.g., object descriptors. So-called hallucination streams are trained to predict auxiliary cues, simultaneously fed into classification layers, and then hallucinated at the testing stage to aid network. We design and hallucinate two descriptors, one leveraging four popular object detectors applied to training videos, and the other leveraging image- and video-level saliency detectors. The first descriptor encodes the detector- and Image Net-wise class prediction scores, confidence scores, and spatial locations of bounding boxes and frame indexes to capture the spatio-temporal distribution of features per video. Another descriptor encodes spatio-angular gradient distributions of saliency maps and intensity patterns. Inspired by the characteristic function of the probability distribution, we capture four statistical moments on the above intermediate descriptors. As numbers of coefficients in the mean, covariance, coskewness and cokurtotsis grow linearly, quadratically, cubically and quartically w.r.t. the dimension of feature vectors, we describe the covariance matrix by its leading n' eigenvectors (so-called subspace) and we capture skewness/kurtosis rather than costly coskewness/cokurtosis. We obtain state of the art on five popular datasets such as Charades and EPIC-Kitchens.

[1]  C. Schmid,et al.  On the burstiness of visual elements , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Krystian Mikolajczyk,et al.  Soft assignment of visual words as Linear Coordinate Coding and optimisation of its reconstruction error , 2011, 2011 18th IEEE International Conference on Image Processing.

[3]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[4]  Anoop Cherian,et al.  Higher-Order Pooling of CNN Features via Kernel Linearization for Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[5]  Thomas B. Moeslund,et al.  Selective spatio-temporal interest points , 2012, Comput. Vis. Image Underst..

[6]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[8]  Michael S. Ryoo,et al.  AssembleNet++: Assembling Modality Representations via Attention Connections , 2020, ECCV.

[9]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[10]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[11]  Anoop Cherian,et al.  Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons , 2016, ECCV.

[12]  Michael S. Ryoo,et al.  AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures , 2019, ICLR.

[13]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[16]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Richard P. Wildes,et al.  Temporal Residual Networks for Dynamic Scene Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[19]  Thomas Brox,et al.  Universität Des Saarlandes Fachrichtung 6.1 – Mathematik Highly Accurate Optic Flow Computation with Theoretically Justified Warping Highly Accurate Optic Flow Computation with Theoretically Justified Warping , 2022 .

[20]  Can Zhang,et al.  PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition , 2019, ACM Multimedia.

[21]  C. Bregler,et al.  Large displacement optical flow , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[23]  Yali Wang,et al.  PA3D: Pose-Action 3D Machine for Video Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Ali Borji,et al.  Salient Object Detection: A Benchmark , 2015, IEEE Transactions on Image Processing.

[25]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[26]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[28]  Trang Nguyen,et al.  Generalized Max Pooling for Action Recognition , 2015, 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE).

[29]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[31]  Fatih Murat Porikli,et al.  A Deeper Look at Power Normalizations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Tong Wu,et al.  Action Recognition with Bootstrapping based Long-range Temporal Context Attention , 2019, ACM Multimedia.

[33]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[34]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[35]  DarrellTrevor,et al.  Long-Term Recurrent Convolutional Networks for Visual Recognition and Description , 2017 .

[36]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[38]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[39]  Lei Wang Analysis and Evaluation of Kinect-based Action Recognition Algorithms , 2021, ArXiv.

[40]  Lars Petersson,et al.  Bilinear Attention Networks for Person Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Jian Sun,et al.  Saliency Optimization from Robust Background Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Christian Wolf,et al.  Object Level Visual Reasoning in Videos , 2018, ECCV.

[47]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[48]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Richard P. Wildes,et al.  A New Large Scale Dynamic Texture Dataset with Application to ConvNet Understanding , 2018, ECCV.

[51]  Anoop Cherian,et al.  Tensor Representations for Action Recognition , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Zhuowen Tu,et al.  Deeply Supervised Salient Object Detection with Short Connections , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Anoop Cherian,et al.  Non-linear Temporal Subspace Representations for Activity Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Piotr Koniusz,et al.  Power Normalizations in Fine-Grained Image, Few-Shot Image and Graph Classification , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Michael S. Ryoo,et al.  Evolving Space-Time Neural Architectures for Videos , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  K. Mikolajczyk,et al.  Higher-order Occurrence Pooling on Mid- and Low-level Features: Visual Concept Detection , 2013 .

[59]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[60]  Marios Hadjieleftheriou,et al.  Finding frequent items in data streams , 2008, Proc. VLDB Endow..

[61]  Lior Wolf,et al.  Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[62]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[63]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Cordelia Schmid,et al.  EpicFlow: Edge-preserving interpolation of correspondences for optical flow , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Jing Zhang,et al.  Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  William T. Freeman,et al.  Orientation Histograms for Hand Gesture Recognition , 1995 .

[68]  Nicu Sebe,et al.  Realtime Video Classification using Dense HOF/HOG , 2014, ICMR.

[69]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Nanning Zheng,et al.  A Slow-I-Fast-P Architecture for Compressed Video Action Recognition , 2020, ACM Multimedia.

[72]  Huchuan Lu,et al.  Saliency Detection with Recurrent Fully Convolutional Networks , 2016, ECCV.

[73]  Romain Dupont,et al.  A General Dense Image Matching Framework Combining Direct and Feature-Based Costs , 2013, 2013 IEEE International Conference on Computer Vision.

[74]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[75]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[76]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  HuXiaowei,et al.  Salient Object Detection , 2017 .

[78]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[79]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Lianqiang Zhou,et al.  Hallucinating Optical Flow Features for Video Classification , 2019, IJCAI.

[81]  Cordelia Schmid,et al.  PoTion: Pose MoTion Representation for Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[82]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83]  Anoop Cherian,et al.  Learning Discriminative Video Representations Using Adversarial Perturbations , 2018, ECCV.

[84]  Lei Wang,et al.  Loss Switching Fusion with Similarity Search for Video Classification , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[85]  Hui Wang,et al.  Human Action Recognition Using Multi-Velocity STIPs and Motion Energy Orientation Histogram , 2014, J. Inf. Sci. Eng..

[86]  Ming Shao,et al.  Finding Achilles' Heel: Adversarial Attack on Multi-modal Action Recognition , 2020, ACM Multimedia.

[87]  Krystian Mikolajczyk,et al.  Higher-Order Occurrence Pooling for Bags-of-Words: Visual Concept Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[88]  Cordelia Schmid,et al.  Convolutional Kernel Networks , 2014, NIPS.

[89]  Rasmus Pagh,et al.  Fast and scalable polynomial kernels via explicit feature maps , 2013, KDD.

[90]  Basura Fernando,et al.  Learning End-to-end Video Classification with Rank-Pooling , 2016, ICML.

[91]  Krystian Mikolajczyk,et al.  Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection , 2013, Comput. Vis. Image Underst..

[92]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[93]  Lei Wang,et al.  A Comparative Review of Recent Kinect-Based Action Recognition Algorithms , 2019, IEEE Transactions on Image Processing.

[94]  Du Q. Huynh,et al.  Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition With CNNs , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[95]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[96]  Jing Zhang,et al.  Few-Shot Learning via Saliency-Guided Hallucination of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[98]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[99]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[100]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[101]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..