Spatio-temporal covariance descriptors for action and gesture recognition

We propose a new action and gesture recognition method based on spatio-temporal covariance descriptors and a weighted Riemannian locality preserving projection approach that takes into account the curved space formed by the descriptors. The weighted projection is then exploited during boosting to create a final multiclass classification algorithm that employs the most useful spatio-temporal regions. We also show how the descriptors can be computed quickly through the use of integral video representations. Experiments on the UCF sport, CK+ facial expression and Cambridge hand gesture datasets indicate superior performance of the proposed method compared to several recent state-of-the-art techniques. The proposed method is robust and does not require additional processing of the videos, such as foreground detection, interest-point detection or tracking.

[1]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[2]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[3]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[4]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[7]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[8]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[9]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Václav Hlavác,et al.  Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Fatih Murat Porikli,et al.  Pedestrian Detection via Classification on Riemannian Manifolds , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Matti Pietikäinen,et al.  Human Activity Recognition Using a Dynamic Texture Based Method , 2008, BMVC.

[14]  Hongbin Zha,et al.  Riemannian Manifold Learning , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Tae-Kyun Kim,et al.  Canonical Correlation Analysis of Video Volume Tensors for Action Categorization and Detection , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[18]  J. Ross Beveridge,et al.  Action classification on product manifolds , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Mubarak Shah,et al.  Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[22]  Janusz Konrad,et al.  Action Recognition Using Sparse Representation on Covariance Manifolds of Optical Flow , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[23]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[24]  Dong Xu,et al.  Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[25]  Jake K. Aggarwal,et al.  Facial expression recognition with temporal modeling of shapes , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[26]  Brian C. Lovell,et al.  Improved anomaly detection in crowded scenes via cell-based analysis of foreground speed, size and texture , 2011, CVPR 2011 WORKSHOPS.

[27]  Sridha Sridharan,et al.  Person-independent facial expression detection using Constrained Local Models , 2011, Face and Gesture 2011.

[28]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[29]  Brian C. Lovell,et al.  K-tangent spaces on Riemannian manifolds for improved pedestrian detection , 2014, 2012 19th IEEE International Conference on Image Processing.

[30]  Brian C. Lovell,et al.  Kernel analysis over Riemannian manifolds for visual recognition of actions, pedestrians and textures , 2012, 2012 IEEE Workshop on the Applications of Computer Vision (WACV).

[31]  Yui Man Lui,et al.  Tangent Bundles on Special Manifolds for Action Recognition , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Brian C. Lovell,et al.  Sparse Coding and Dictionary Learning for Symmetric Positive Definite Matrices: A Kernel Approach , 2012, ECCV.