Tensor Representations for Action Recognition

Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) sequence compatibility kernel (SCK) and (ii) dynamics compatibility kernel (DCK); the former building on the spatio-temporal correlations between features, while the latter explicitly modeling the action dynamics of a sequence. We also explore generalization of SCK, coined SCK⊕, that operates on subsequences to capture the local-global interplay of correlations, which can incorporate multi-modal inputs e.g., skeleton 3D body-joints and per-frame classifier scores obtained from deep learning models trained on videos. We introduce linearization of these kernels that lead to compact and fast descriptors. We provide experiments on (i) 3D skeleton action sequences, (ii) fine-grained video sequences, and (iii) standard non-fine-grained videos. As our final representations are tensors that capture higher-order relationships of features, they relate to co-occurrences for robust fine-grained recognition [1, 2]. We use higher-order tensors and so-called Eigenvalue Power Normalization (EPN) which have been long speculated to perform spectral detection of higher-order occurrences [3, 4], thus detecting fine-grained relationships of features rather than merely count features in action sequences. We prove that a tensor of order r, built from Z∗ dimensional features, coupled with EPN indeed detects if at least one higher-order occurrence is ‘projected’ into one of its (Z∗ r ) subspaces of dim. r represented by the tensor, thus forming a Tensor Power Normalization metric endowed with (Z∗ r ) such ‘detectors’.

[1]  Anoop Cherian,et al.  Non-linear Temporal Subspace Representations for Activity Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[3]  Binlong Li,et al.  Cross-view activity recognition using Hankelets , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Ahmed M. Elgammal,et al.  Tracking People on a Torus , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Lei Wang,et al.  Few-Shot Object Detection by Second-Order Pooling , 2020, ACCV.

[6]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[7]  Piotr Koniusz,et al.  Power Normalizations in Fine-grained Image, Few-shot Image and Graph Classification , 2020, ArXiv.

[8]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[9]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Hongguang Zhang,et al.  Power Normalizing Second-Order Similarity Network for Few-Shot Learning , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  Subhransu Maji,et al.  Second-order Democratic Aggregation , 2018, ECCV.

[12]  Xiaodong Yang,et al.  Effective 3D action recognition using EigenJoints , 2014, J. Vis. Commun. Image Represent..

[13]  Hassen Drira,et al.  Coding Kendall's Shape Trajectories for 3D Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Xin Yu,et al.  Recovering Faces From Portraits with Auxiliary Facial Attributes , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Mohan M. Trivedi,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Rama Chellappa,et al.  Locally time-invariant models of human activities using trajectories on the grassmannian , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Marco La Cascia,et al.  3D skeleton-based human action classification: A survey , 2016, Pattern Recognit..

[22]  Xin Yu,et al.  Identity-Preserving Face Recovery from Stylized Portraits , 2019, International Journal of Computer Vision.

[23]  Demetri Terzopoulos,et al.  TensorTextures: multilinear image-based rendering , 2004, ACM Trans. Graph..

[24]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[25]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[26]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Tamir Hazan,et al.  Non-negative tensor factorization with applications to statistics and computer vision , 2005, ICML.

[28]  Rama Chellappa,et al.  View Invariance for Human Action Recognition , 2005, International Journal of Computer Vision.

[29]  Piotr Koniusz,et al.  Identity-Preserving Face Recovery from Portraits , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Dieter Fox,et al.  Object recognition with hierarchical kernel descriptors , 2011, CVPR 2011.

[32]  Lei Wang,et al.  Beyond Covariance: Feature Representation with Nonlinear Kernel Matrices , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[34]  Ramakant Nevatia,et al.  Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost , 2006, ECCV.

[35]  Nanning Zheng,et al.  View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Eli Shechtman,et al.  Space-time behavior based correlation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[37]  Anoop Cherian,et al.  Second-order Temporal Pooling for Action Recognition , 2017, International Journal of Computer Vision.

[38]  Michael J. Black,et al.  Puppet Flow , 2013 .

[39]  K. Mikolajczyk,et al.  Higher-order Occurrence Pooling on Mid- and Low-level Features: Visual Concept Detection , 2013 .

[40]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[42]  C. Schmid,et al.  On the burstiness of visual elements , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Krystian Mikolajczyk,et al.  Higher-Order Occurrence Pooling for Bags-of-Words: Visual Concept Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Wei Zeng,et al.  Learning Long-Term Dependencies for Action Recognition with a Biologically-Inspired Deep Network , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[46]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[47]  Piotr Koniusz,et al.  CNN-based Action Recognition and Supervised Domain Adaptation on 3D Body Skeletons via Kernel Feature Maps , 2018, BMVC.

[48]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[49]  Jian-Huang Lai,et al.  Deep Bilinear Learning for RGB-D Action Recognition , 2018, ECCV.

[50]  Richard Nock,et al.  On Modulating the Gradient for Meta-learning , 2020, ECCV.

[51]  Anoop Cherian,et al.  Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons , 2016, ECCV.

[52]  Anoop Cherian,et al.  Higher-Order Pooling of CNN Features via Kernel Linearization for Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[53]  Mehrtash Tafazzoli Harandi,et al.  Bregman Divergences for Infinite Dimensional Covariance Matrices , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Tae-Kyun Kim,et al.  Tensor Canonical Correlation Analysis for Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[56]  Cordelia Schmid,et al.  A time series kernel for action recognition , 2011, BMVC.

[57]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Vittorio Murino,et al.  Kernelized covariance for action recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[59]  Hongdong Li,et al.  Few-Shot Action Recognition with Permutation-Invariant Attention , 2020, ECCV.

[60]  Xin Yu,et al.  Face Destylization , 2017, 2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[61]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[64]  Ke Sun,et al.  Fisher-Bures Adversary Graph Convolutional Networks , 2019, UAI.

[65]  Wanqing Li,et al.  Beyond Covariance: SICE and Kernel Based Visual Feature Representation , 2020, International Journal of Computer Vision.

[66]  Subhransu Maji,et al.  Improved Bilinear Pooling with CNNs , 2017, BMVC.

[67]  Alberto Del Bimbo,et al.  Recognizing Actions from Depth Cameras as Weakly Aligned Multi-part Bag-of-Poses , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[68]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[69]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[70]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[71]  Bingbing Ni,et al.  Interaction part mining: A mid-level approach for fine-grained action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Mehrtash Harandi,et al.  Adaptive Subspaces for Few-Shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[74]  Anoop Cherian,et al.  Learning Discriminative Video Representations Using Adversarial Perturbations , 2018, ECCV.

[75]  Jianhua Li,et al.  A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection , 2011, IWDW.

[76]  Anoop Cherian,et al.  Sparse Coding for Third-Order Super-Symmetric Tensor Descriptors with Application to Texture Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Fatih Murat Porikli,et al.  A Deeper Look at Power Normalizations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[78]  Haiping Lu,et al.  A survey of multilinear subspace learning for tensor data , 2011, Pattern Recognit..

[79]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[80]  Yi Lin,et al.  Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[81]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  Amit K. Roy-Chowdhury,et al.  Joint Prediction of Activity Labels and Starting Times in Untrimmed Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[83]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[84]  Trang Nguyen,et al.  Generalized Max Pooling for Action Recognition , 2015, 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE).

[85]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.

[87]  Demetri Terzopoulos,et al.  Multilinear Analysis of Image Ensembles: TensorFaces , 2002, ECCV.

[88]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[89]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[90]  Michael J. Black,et al.  Parameterized Modeling and Recognition of Activities , 1999, Comput. Vis. Image Underst..

[91]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).