Single/multi-view human action recognition via regularized multi-task learning

Abstract This paper proposes a unified single/multi-view human action recognition method via regularized multi-task learning. First, we propose the pyramid partwise bag of words (PPBoW) representation which implicitly encodes both local visual characteristics and human body structure. Furthermore, we formulate the task of single/multi-view human action recognition into a part-induced multi-task learning problem penalized by graph structure and sparsity to discover the latent correlation among multiple views and body parts and consequently boost the performances. The experiment shows that this method can significantly improve performance over the standard BoW+SVM method. Moreover, the proposed method can achieve competing performance simply with low dimensional PPBoW representation against the state-of-the-art methods for human action recognition on KTH and MV-TJU, a new multi-view action dataset with RGB, depth and skeleton data prepared by our group.

[1]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[2]  Sebastian Nowozin,et al.  Discriminative Subsequence Mining for Action Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[3]  Ling Shao,et al.  Embedding Motion and Structure Features for Action Recognition , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Mubarak Shah,et al.  Actions sketch: a novel action representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Min Chen,et al.  Video Semantic Event/Concept Detection Using a Subspace-Based Multimedia Data Mining Framework , 2008, IEEE Transactions on Multimedia.

[7]  Shuicheng Yan,et al.  Robust Image Analysis With Sparse Representation on Quantized Visual Features , 2013, IEEE Transactions on Image Processing.

[8]  Rui Li,et al.  Simultaneous Learning of Nonlinear Manifold and Dynamical Models for High-dimensional Time Series , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Xuelong Li,et al.  Modality Mixture Projections for Semantic Video Event Detection , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Erkki Oja,et al.  Projective Nonnegative Matrix Factorization for Image Compression and Feature Extraction , 2005, SCIA.

[11]  Yuting Su,et al.  Multiple/Single-View Human Action Recognition via Part-Induced Multitask Structural Learning , 2015, IEEE Transactions on Cybernetics.

[12]  Qi Tian,et al.  Less is More: Efficient 3-D Object Retrieval With Query View Selection , 2011, IEEE Transactions on Multimedia.

[13]  Yue Gao,et al.  3-D Object Retrieval and Recognition With Hypergraph Analysis , 2012, IEEE Transactions on Image Processing.

[14]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[15]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Yuting Su,et al.  Partwise bag-of-words-based multi-task learning for human action recognition , 2013 .

[17]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[20]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[21]  Xiao Liu,et al.  Probabilistic Graphlet Transfer for Photo Cropping , 2013, IEEE Transactions on Image Processing.

[22]  Alberto Del Bimbo,et al.  Effective Codebooks for Human Action Representation and Classification in Unconstrained Videos , 2012, IEEE Transactions on Multimedia.

[23]  Wen Gao,et al.  Mining Compact Bag-of-Patterns for Low Bit Rate Mobile Visual Search , 2014, IEEE Transactions on Image Processing.

[24]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Yue Gao,et al.  View-Based Discriminative Probabilistic Modeling for 3D Object Retrieval and Recognition , 2013, IEEE Transactions on Image Processing.

[27]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[28]  Xuelong Li,et al.  Fusion of Multichannel Local and Global Structural Cues for Photo Aesthetics Evaluation , 2014, IEEE Transactions on Image Processing.

[29]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[30]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[31]  Yue Gao,et al.  Cross-View Down/Up-Sampling Method for Multiview Depth Video Coding , 2012, IEEE Signal Processing Letters.

[32]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[33]  Yue Gao,et al.  3D model comparison using spatial structure circular descriptor , 2010, Pattern Recognit..

[34]  Mario Cannataro,et al.  Protein-to-protein interactions: Technologies, databases, and algorithms , 2010, CSUR.

[35]  Dacheng Tao,et al.  Slow Feature Analysis for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Yanbing Xue,et al.  Human action recognition based on sparse representation induced by L1/L2 regulations , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[37]  Weifeng Liu,et al.  Multiview Hessian Regularization for Image Annotation , 2013, IEEE Transactions on Image Processing.

[38]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Li Ma,et al.  Coupled hidden conditional random fields for RGB-D human action recognition , 2015, Signal Process..

[40]  Wen Gao,et al.  Location Discriminative Vocabulary Coding for Mobile Landmark Search , 2012, International Journal of Computer Vision.

[41]  Mubarak Shah,et al.  View-Invariant Representation and Recognition of Actions , 2002, International Journal of Computer Vision.

[42]  Yongdong Zhang,et al.  Efficient Parallel Framework for H.264/AVC Deblocking Filter on Many-Core Platform , 2012, IEEE Transactions on Multimedia.

[43]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[44]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[45]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[46]  Bingbing Ni,et al.  RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, ICCV Workshops.

[47]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[49]  Silvio Savarese,et al.  Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.

[50]  Zicheng Liu,et al.  Hierarchical Filtered Motion for Action Recognition in Crowded Videos , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[51]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[53]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Qi Tian,et al.  Task-Dependent Visual-Codebook Compression , 2012, IEEE Transactions on Image Processing.

[55]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[56]  Min-Chun Hu,et al.  Human action recognition and retrieval using sole depth information , 2012, ACM Multimedia.

[57]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.

[58]  Xindong Wu,et al.  3-D Object Retrieval With Hausdorff Distance Learning , 2014, IEEE Transactions on Industrial Electronics.

[59]  Yong Luo,et al.  Manifold Regularized Multitask Learning for Semi-Supervised Multilabel Image Classification , 2013, IEEE Transactions on Image Processing.

[60]  Ivan Laptev,et al.  Local Descriptors for Spatio-temporal Recognition , 2004, SCVMA.

[61]  Yue Gao,et al.  View-based 3D object retrieval by bipartite graph matching , 2012, ACM Multimedia.

[62]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[63]  Yue Gao,et al.  Camera Constraint-Free View-Based 3-D Object Retrieval , 2012, IEEE Transactions on Image Processing.

[64]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[65]  Jake K. Aggarwal,et al.  Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Ivan Laptev,et al.  Local spatio-temporal image features for motion interpretation , 2004 .

[67]  Guillermo Sapiro,et al.  Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations , 2009, NIPS.

[68]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[69]  Jake K. Aggarwal,et al.  Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[70]  Ali Farhadi,et al.  Learning to Recognize Activities from the Wrong View Point , 2008, ECCV.

[71]  Mubarak Shah,et al.  Discovering Motion Primitives for Unsupervised Grouping and One-Shot Learning of Human Actions, Gestures, and Expressions , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Yue Gao,et al.  Remote Dynamic Three-Dimensional Scene Reconstruction , 2013, PloS one.