Tensor Representations for Action Recognition

Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) <italic>sequence compatibility kernel</italic> (SCK) and (ii) <italic>dynamics compatibility kernel</italic> (DCK). SCK builds on the spatio-temporal correlations between features, whereas DCK explicitly models the action dynamics of a sequence. We also explore generalization of SCK, coined SCK<inline-formula><tex-math notation="LaTeX">$\;\oplus$</tex-math><alternatives><mml:math><mml:mrow><mml:mspace width="0.277778em"/><mml:mo>⊕</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="koniusz-ieq1-3107160.gif"/></alternatives></inline-formula>, that operates on subsequences to capture the local-global interplay of correlations, which can incorporate multi-modal inputs e.g., skeleton 3D body-joints and per-frame classifier scores obtained from deep learning models trained on videos. We introduce linearization of these kernels that lead to compact and fast descriptors. We provide experiments on (i) 3D skeleton action sequences, (ii) fine-grained video sequences, and (iii) standard non-fine-grained videos. As our final representations are tensors that capture higher-order relationships of features, they relate to co-occurrences for robust fine-grained recognition (Lin, 2017), (Koniusz, 2018). We use higher-order tensors and so-called Eigenvalue Power Normalization (EPN) which have been long speculated to perform spectral detection of higher-order occurrences (Koniusz, 2013), (Koniusz, 2017), thus detecting fine-grained relationships of features rather than merely count features in action sequences. We prove that a tensor of order <inline-formula><tex-math notation="LaTeX">$r$</tex-math><alternatives><mml:math><mml:mi>r</mml:mi></mml:math><inline-graphic xlink:href="koniusz-ieq2-3107160.gif"/></alternatives></inline-formula>, built from <inline-formula><tex-math notation="LaTeX">$Z_*$</tex-math><alternatives><mml:math><mml:msub><mml:mi>Z</mml:mi><mml:mo>*</mml:mo></mml:msub></mml:math><inline-graphic xlink:href="koniusz-ieq3-3107160.gif"/></alternatives></inline-formula> dimensional features, coupled with EPN indeed detects if at least one higher-order occurrence is ‘projected’ into one of its <inline-formula><tex-math notation="LaTeX">$\binom{Z_*}{r}$</tex-math><alternatives><mml:math><mml:mfenced separators="" open="(" close=")"><mml:mfrac linethickness="0pt"><mml:msub><mml:mi>Z</mml:mi><mml:mo>*</mml:mo></mml:msub><mml:mi>r</mml:mi></mml:mfrac></mml:mfenced></mml:math><inline-graphic xlink:href="koniusz-ieq4-3107160.gif"/></alternatives></inline-formula> subspaces of dim. <inline-formula><tex-math notation="LaTeX">$r$</tex-math><alternatives><mml:math><mml:mi>r</mml:mi></mml:math><inline-graphic xlink:href="koniusz-ieq5-3107160.gif"/></alternatives></inline-formula> represented by the tensor, thus forming a Tensor Power Normalization metric endowed with <inline-formula><tex-math notation="LaTeX">$\binom{Z_*}{r}$</tex-math><alternatives><mml:math><mml:mfenced separators="" open="(" close=")"><mml:mfrac linethickness="0pt"><mml:msub><mml:mi>Z</mml:mi><mml:mo>*</mml:mo></mml:msub><mml:mi>r</mml:mi></mml:mfrac></mml:mfenced></mml:math><inline-graphic xlink:href="koniusz-ieq6-3107160.gif"/></alternatives></inline-formula> such ‘detectors’.

[1]  Piotr Koniusz,et al.  Power Normalizations in Fine-Grained Image, Few-Shot Image and Graph Classification , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Piotr Koniusz,et al.  REFINE: Random RangE FInder for Network Embedding , 2021, CIKM.

[3]  Piotr Koniusz,et al.  Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors , 2020, ACM Multimedia.

[4]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Piotr Koniusz,et al.  Simple Spectral Graph Convolution , 2021, ICLR.

[6]  Wanqing Li,et al.  Beyond Covariance: SICE and Kernel Based Visual Feature Representation , 2020, International Journal of Computer Vision.

[7]  Mehrtash Harandi,et al.  Adaptive Subspaces for Few-Shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Hongdong Li,et al.  Few-Shot Action Recognition with Permutation-Invariant Attention , 2020, ECCV.

[9]  Lei Wang,et al.  Few-Shot Object Detection by Second-Order Pooling , 2020, ACCV.

[10]  Richard Nock,et al.  On Modulating the Gradient for Meta-learning , 2020, ECCV.

[11]  Du Q. Huynh,et al.  Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition With CNNs , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Xin Yu,et al.  Identity-Preserving Face Recovery from Stylized Portraits , 2019, International Journal of Computer Vision.

[13]  Ke Sun,et al.  Fisher-Bures Adversary Graph Convolutional Networks , 2019, UAI.

[14]  Xin Yu,et al.  Recovering Faces From Portraits with Auxiliary Facial Attributes , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[15]  Hongguang Zhang,et al.  Power Normalizing Second-Order Similarity Network for Few-Shot Learning , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  Jian-Huang Lai,et al.  Deep Bilinear Learning for RGB-D Action Recognition , 2018, ECCV.

[17]  Subhransu Maji,et al.  Second-order Democratic Aggregation , 2018, ECCV.

[18]  Anoop Cherian,et al.  Second-order Temporal Pooling for Action Recognition , 2017, International Journal of Computer Vision.

[19]  Anoop Cherian,et al.  Learning Discriminative Video Representations Using Adversarial Perturbations , 2018, ECCV.

[20]  Piotr Koniusz,et al.  CNN-based Action Recognition and Supervised Domain Adaptation on 3D Body Skeletons via Kernel Feature Maps , 2018, BMVC.

[21]  Fatih Murat Porikli,et al.  A Deeper Look at Power Normalizations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Hassen Drira,et al.  Coding Kendall's Shape Trajectories for 3D Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Anoop Cherian,et al.  Non-linear Temporal Subspace Representations for Activity Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[25]  Piotr Koniusz,et al.  Identity-Preserving Face Recovery from Portraits , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Xin Yu,et al.  Face Destylization , 2017, 2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[28]  Amit K. Roy-Chowdhury,et al.  Joint Prediction of Activity Labels and Starting Times in Untrimmed Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Subhransu Maji,et al.  Improved Bilinear Pooling with CNNs , 2017, BMVC.

[30]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[32]  Yi Lin,et al.  Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[33]  Nanning Zheng,et al.  View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Krystian Mikolajczyk,et al.  Higher-Order Occurrence Pooling for Bags-of-Words: Visual Concept Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Anoop Cherian,et al.  Higher-Order Pooling of CNN Features via Kernel Linearization for Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  Wei Zeng,et al.  Learning Long-Term Dependencies for Action Recognition with a Biologically-Inspired Deep Network , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  K. Mikolajczyk,et al.  Higher-Order Occurrence Pooling for Bags-of-Words: Visual Concept Detection. , 2017, IEEE transactions on pattern analysis and machine intelligence.

[39]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[40]  Anoop Cherian,et al.  Sparse Coding for Third-Order Super-Symmetric Tensor Descriptors with Application to Texture Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[42]  Marco La Cascia,et al.  3D skeleton-based human action classification: A survey , 2016, Pattern Recognit..

[43]  Vittorio Murino,et al.  Kernelized covariance for action recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[44]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Anoop Cherian,et al.  Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons , 2016, ECCV.

[46]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Lei Wang,et al.  Beyond Covariance: Feature Representation with Nonlinear Kernel Matrices , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Trang Nguyen,et al.  Generalized Max Pooling for Action Recognition , 2015, 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE).

[49]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Bingbing Ni,et al.  Interaction part mining: A mid-level approach for fine-grained action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[55]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Mehrtash Tafazzoli Harandi,et al.  Bregman Divergences for Infinite Dimensional Covariance Matrices , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Xiaodong Yang,et al.  Effective 3D action recognition using EigenJoints , 2014, J. Vis. Commun. Image Represent..

[58]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[59]  K. Mikolajczyk,et al.  Higher-order Occurrence Pooling on Mid- and Low-level Features: Visual Concept Detection , 2013 .

[60]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[61]  Alberto Del Bimbo,et al.  Recognizing Actions from Depth Cameras as Weakly Aligned Multi-part Bag-of-Poses , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[62]  Mohan M. Trivedi,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[63]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[65]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.

[67]  Michael J. Black,et al.  Puppet Flow , 2013 .

[68]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[69]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Binlong Li,et al.  Cross-view activity recognition using Hankelets , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[72]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[74]  Jianhua Li,et al.  A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection , 2011, IWDW.

[75]  Cordelia Schmid,et al.  A time series kernel for action recognition , 2011, BMVC.

[76]  Haiping Lu,et al.  A survey of multilinear subspace learning for tensor data , 2011, Pattern Recognit..

[77]  Dieter Fox,et al.  Object recognition with hierarchical kernel descriptors , 2011, CVPR 2011.

[78]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[79]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[81]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[82]  C. Schmid,et al.  On the burstiness of visual elements , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[83]  Rama Chellappa,et al.  Locally time-invariant models of human activities using trajectories on the grassmannian , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[84]  Ahmed M. Elgammal,et al.  Tracking People on a Torus , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[85]  Tae-Kyun Kim,et al.  Tensor Canonical Correlation Analysis for Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[86]  Ramakant Nevatia,et al.  Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost , 2006, ECCV.

[87]  Tamir Hazan,et al.  Non-negative tensor factorization with applications to statistics and computer vision , 2005, ICML.

[88]  Eli Shechtman,et al.  Space-time behavior based correlation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[89]  Rama Chellappa,et al.  View Invariance for Human Action Recognition , 2005, International Journal of Computer Vision.

[90]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[91]  Demetri Terzopoulos,et al.  TensorTextures: multilinear image-based rendering , 2004, ACM Trans. Graph..

[92]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[93]  Demetri Terzopoulos,et al.  Multilinear Analysis of Image Ensembles: TensorFaces , 2002, ECCV.

[94]  Michael J. Black,et al.  Parameterized modeling and recognition of activities , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[95]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .