Action recognition using global spatio-temporal features derived from sparse representations

Abstract Recognizing actions is one of the important challenges in computer vision with respect to video data, with applications to surveillance, diagnostics of mental disorders, and video retrieval. Compared to other data modalities such as documents and images, processing video data demands orders of magnitude higher computational and storage resources. One way to alleviate this difficulty is to focus the computations to informative (salient) regions of the video. In this paper, we propose a novel global spatio-temporal self-similarity measure to score saliency using the ideas of dictionary learning and sparse coding. In contrast to existing methods that use local spatio-temporal feature detectors along with descriptors (such as HOG, HOG3D, and HOF), dictionary learning helps consider the saliency in a global setting (on the entire video) in a computationally efficient way. We consider only a small percentage of the most salient (least self-similar) regions found using our algorithm, over which spatio-temporal descriptors such as HOG and region covariance descriptors are computed. The ensemble of such block descriptors in a bag-of-features framework provides a holistic description of the motion sequence which can be used in a classification setting. Experiments on several benchmark datasets in video based action classification demonstrate that our approach performs competitively to the state of the art.

[1]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[2]  Volker Nannen,et al.  A Short Introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length (MDL) , 2010, ArXiv.

[3]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Ling Shao,et al.  Feature detector and descriptor evaluation in human action recognition , 2010, CIVR '10.

[8]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[9]  Jorma Rissanen,et al.  MDL Denoising , 2000, IEEE Trans. Inf. Theory.

[10]  Fatih Murat Porikli,et al.  Region Covariance: A Fast Descriptor for Detection and Classification , 2006, ECCV.

[11]  Thomas G. Dietterich,et al.  Automated insect identification through concatenated histograms of local appearance features: feature vector generation and region detection for deformable objects , 2007, 2007 IEEE Workshop on Applications of Computer Vision (WACV '07).

[12]  Somayeh Danafar,et al.  Action Recognition for Surveillance Applications Using Optic Flow and SVM , 2007, ACCV.

[13]  Guillermo Sapiro,et al.  Classification and clustering via dictionary learning with structured incoherence and shared features , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.

[15]  Guillermo Sapiro,et al.  An MDL Framework for Sparse Coding and Dictionary Learning , 2011, IEEE Transactions on Signal Processing.

[16]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17]  Liqing Zhang,et al.  Dynamic visual attention: searching for coding length increments , 2008, NIPS.

[18]  Guillermo Sapiro,et al.  Sparse Modeling of Human Actions from Motion Imagery , 2012, International Journal of Computer Vision.

[19]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[21]  Shaogang Gong,et al.  Recognising action as clouds of space-time interest points , 2009, CVPR.

[22]  Janusz Konrad,et al.  Action Recognition in Video by Covariance Matching of Silhouette Tunnels , 2009, 2009 XXII Brazilian Symposium on Computer Graphics and Image Processing.

[23]  Michael Elad,et al.  Efficient Implementation of the K-SVD Algorithm using Batch Orthogonal Matching Pursuit , 2008 .

[24]  N. Ayache,et al.  Log‐Euclidean metrics for fast and simple calculus on diffusion tensors , 2006, Magnetic resonance in medicine.

[25]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[26]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[27]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[29]  Thomas B. Moeslund,et al.  Selective spatio-temporal interest points , 2012, Comput. Vis. Image Underst..

[30]  Vladimir Kolmogorov,et al.  Computing visual correspondence with occlusions using graph cuts , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[31]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[32]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[33]  David Windridge,et al.  An evaluation of bags-of-words and spatio-temporal shapes for action recognition , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[34]  Michael Brady,et al.  Saliency, Scale and Image Description , 2001, International Journal of Computer Vision.

[35]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[36]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[37]  Yin Li,et al.  Visual Saliency Based on Conditional Entropy , 2009, ACCV.

[38]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[39]  James M. Rehg,et al.  Decoding Children's Social Behavior , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[41]  Benjamin Bustos,et al.  Harris 3D: a robust extension of the Harris operator for interest point detection on 3D meshes , 2011, The Visual Computer.

[42]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[43]  Mubarak Shah,et al.  Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories , 2011, 2011 International Conference on Computer Vision.

[44]  S. Gong,et al.  Recognising action as clouds of space-time interest points , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[47]  Naoki Saito,et al.  Simultaneous noise suppression and signal compression using a library of orthonormal bases and the minimum-description-length criterion , 1994, Defense, Security, and Sensing.

[48]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[49]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[51]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[52]  Mubarak Shah,et al.  Recognizing human actions using multiple features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[54]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, CVPR Workshops.

[56]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[57]  Nikolai K. Vereshchagin,et al.  Kolmogorov's structure functions with an application to the foundations of model selection , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[58]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[59]  John K. Tsotsos,et al.  Saliency Based on Information Maximization , 2005, NIPS.

[60]  Dan Schonfeld,et al.  Object Trajectory-Based Activity Classification and Recognition Using Hidden Markov Models , 2007, IEEE Transactions on Image Processing.

[61]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[62]  Rama Chellappa,et al.  Sparse dictionary-based representation and recognition of action attributes , 2011, 2011 International Conference on Computer Vision.

[63]  Janusz Konrad,et al.  Action Recognition Using Sparse Representation on Covariance Manifolds of Optical Flow , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[64]  Andrew Gilbert,et al.  Action Recognition Using Mined Hierarchical Compound Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Luc Van Gool,et al.  Hough Forests for Object Detection, Tracking, and Action Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..