Tangent Fisher Vector on Matrix Manifolds for Action Recognition

In this paper, we address the problem of representing and recognizing human actions from videos on matrix manifolds. For this purpose, we propose a new vector representation method, named tangent Fisher vector, to describe video sequences in the Fisher kernel framework. We first extract dense curved spatio-temporal cuboids from each video sequence. Compared with the traditional ’straight cuboids’, the dense curved spatio-temporal cuboids contain much more local motion information. Each cuboid is then described using a linear dynamical system (LDS) to simultaneously capture the local appearance and dynamics. Furthermore, a simple yet efficient algorithm is proposed to learn the LDS parameters and approximate the observability matrix at the same time. Each video sequence is thus represented by a set of LDSs. Considering that each LDS can be viewed as a point in a Grassmann manifold, we propose to learn an intrinsic GMM on the manifold to cluster the LDS points. Finally a tangent Fisher vector is computed by first accumulating all the tangent vectors in each Gaussian component, and then concatenating the normalized results across all the Gaussian components. A kernel is defined to measure the similarity between tangent Fisher vectors for classification and recognition of a video sequence. This approach is evaluated on the state-of-the-art human action benchmark datasets. The recognition performance is competitive when compared with current state-of-the-art results.

[1]  Fei Xiong,et al.  Low order dynamics embedding for high dimensional time series , 2011, 2011 International Conference on Computer Vision.

[2]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[3]  Binlong Li,et al.  Cross-view activity recognition using Hankelets , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Stephen J. Maybank,et al.  Learning Human Actions by Combining Global Dynamics and Local Appearance , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Rama Chellappa,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Matching Shape Sequences in Video with Applications in Human Movement Analysis. Ieee Transactions on Pattern Analysis and Machine Intelligence 2 , 2022 .

[8]  Vladimir Pavlovic,et al.  Impact of dynamic model learning on classification of human motion , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[9]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Bart De Moor,et al.  N4SID: Subspace algorithms for the identification of combined deterministic-stochastic systems , 1994, Autom..

[11]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Xiao Liu,et al.  Multimodal Keyless Attention Fusion for Video Classification , 2018, AAAI.

[13]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[14]  Janusz Konrad,et al.  Action Recognition Using Sparse Representation on Covariance Manifolds of Optical Flow , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[15]  Stefano Soatto,et al.  Classification and Recognition of Dynamical Models: The Role of Phase, Independent Components, Kernels and Optimal Transport , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  M. Moonen,et al.  On- and off-line identification of linear state-space models , 1989 .

[17]  Christoph Bregler,et al.  Learning and recognizing human dynamics in video sequences , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Rama Chellappa,et al.  Unsupervised view and rate invariant clustering of video sequences q , 2009 .

[19]  Gregory D. Hager,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, CVPR.

[20]  Cordelia Schmid,et al.  MARS: Motion-Augmented RGB Stream for Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[22]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[23]  Rama Chellappa,et al.  From Videos to Verbs: Mining Videos for Activities using a Cascade of Dynamical Systems , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Rama Chellappa,et al.  Statistical Computations on Grassmann and Stiefel Manifolds for Image and Video-Based Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Gaurav Sharma,et al.  AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yui Man Lui,et al.  Tangent Bundles on Special Manifolds for Action Recognition , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Binlong Li,et al.  Activity recognition using dynamic subspace angles , 2011, CVPR 2011.

[28]  Richard J. Martin A metric for ARMA processes , 2000, IEEE Trans. Signal Process..

[29]  Bart De Moor,et al.  Subspace angles between ARMA models , 2002, Syst. Control. Lett..

[30]  Rama Chellappa,et al.  Rate-Invariant Recognition of Humans and Their Activities , 2009, IEEE Transactions on Image Processing.

[31]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[32]  Xavier Pennec,et al.  Statistical Computing on Manifolds: From Riemannian Geometry to Computational Anatomy , 2009, ETVC.

[33]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[34]  René Vidal,et al.  Group action induced distances for averaging and clustering Linear Dynamical Systems with applications to the analysis of dynamic scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[37]  Hassan Foroosh,et al.  A Temporal Sequence Learning for Action Recognition and Prediction , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[40]  Yue Gao,et al.  Inductive Multi-Hypergraph Learning and Its Application on View-Based 3D Object Classification , 2018, IEEE Transactions on Image Processing.

[41]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[43]  Alexander J. Smola,et al.  Binet-Cauchy Kernels on Dynamical Systems and its Application to the Analysis of Dynamic Scenes , 2007, International Journal of Computer Vision.

[44]  René Vidal,et al.  Categorizing Dynamic Textures Using a Bag of Dynamical Systems , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Geoffrey E. Hinton,et al.  Parameter estimation for linear dynamical systems , 1996 .

[46]  Weiming Hu,et al.  Learning silhouette dynamics for human action recognition , 2013, 2013 IEEE International Conference on Image Processing.

[47]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[48]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[49]  Dacheng Tao,et al.  Slow Feature Analysis for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  P. Absil,et al.  Riemannian Geometry of Grassmann Manifolds with a View on Algorithmic Computation , 2004 .