Coupled hidden conditional random fields for RGB-D human action recognition

This paper proposes a human action recognition method via coupled hidden conditional random fields model by fusing both RGB and depth sequential information. The coupled hidden conditional random fields model extends the standard hidden-state conditional random fields model only with one chain-structure sequential observation to multiple chain-structure sequential observations, which are synchronized sequence data captured in multiple modalities. For model formulation, we propose the specific graph structure for the interaction among multiple modalities and design the corresponding potential functions. Then we propose the model learning and inference methods to discover the latent correlation between RGB and depth data as well as model temporal context within individual modality. The extensive experiments show that the proposed model can boost the performance of human action recognition by taking advance of complementary characteristics from both RGB and depth modalities. HighlightsWe propose cHCRF to learn sequence-specific and sequence-shared temporal structure.We contribute a novel RGB-D human action dataset containing 1200 samples.Experiments on 3 popular datasets show the superiority of the proposed method.

[1]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[2]  Zi Huang,et al.  Local image tagging via graph regularized joint group sparsity , 2013, Pattern Recognit..

[3]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[4]  Sheng Tang,et al.  Robust human body segmentation based on part appearance and spatial constraint , 2013, Neurocomputing.

[5]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[6]  Yi Yang,et al.  Discovering Discriminative Graphlets for Aerial Image Categories Recognition , 2013, IEEE Transactions on Image Processing.

[7]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[8]  Xiao Liu,et al.  Probabilistic Graphlet Transfer for Photo Cropping , 2013, IEEE Transactions on Image Processing.

[9]  Li Ma,et al.  Max-margin discriminative random fields for multimodal human action recognition , 2014 .

[10]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[12]  Yue Gao,et al.  Symbiotic Tracker Ensemble Toward A Unified Tracking Framework , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Takeo Kanade,et al.  A Semi-Markov Model for Mitosis Segmentation in Time-Lapse Phase Contrast Microscopy Image Sequences of Stem Cell Populations , 2012, IEEE Transactions on Medical Imaging.

[14]  Ning Chen,et al.  Predictive Subspace Learning for Multi-view Data: a Large Margin Approach , 2010, NIPS.

[15]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[16]  Yue Gao,et al.  Exploiting Web Images for Semantic Video Indexing Via Robust Sample-Specific Loss , 2014, IEEE Transactions on Multimedia.

[17]  Xindong Wu,et al.  3-D Object Retrieval With Hausdorff Distance Learning , 2014, IEEE Transactions on Industrial Electronics.

[18]  Bingbing Ni,et al.  RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[19]  Yuting Su,et al.  Partwise bag-of-words-based multi-task learning for human action recognition , 2013 .

[20]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[21]  Lu Yang,et al.  Combing RGB and Depth Map Features for human activity recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[22]  Wen Gao,et al.  Location Discriminative Vocabulary Coding for Mobile Landmark Search , 2011, International Journal of Computer Vision.

[23]  Min-Chun Hu,et al.  Human action recognition and retrieval using sole depth information , 2012, ACM Multimedia.

[24]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[25]  TorralbaAntonio,et al.  Modeling the Shape of the Scene , 2001 .

[26]  Yanbing Xue,et al.  Human Action Recognition Using Pyramid Histograms of Oriented Gradients and Collaborative Multi-task Learning , 2014, KSII Trans. Internet Inf. Syst..

[27]  Yue Gao,et al.  Image Tagging with Social Assistance , 2014, ICMR.

[28]  Yuting Su,et al.  Multiple/Single-View Human Action Recognition via Part-Induced Multitask Structural Learning , 2015, IEEE Transactions on Cybernetics.

[29]  Guodong Guo,et al.  Fusing Spatiotemporal Features and Joints for 3D Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[30]  Zi Huang,et al.  Tag localization with spatial correlations and joint group sparsity , 2011, CVPR 2011.

[31]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  S KankanhalliMohan,et al.  Multimodal fusion for multimedia analysis , 2010 .

[33]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[34]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Ivan Laptev,et al.  Local Descriptors for Spatio-temporal Recognition , 2004, SCVMA.

[36]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  An-An Liu,et al.  Human action recognition with structured discriminative random fields , 2011 .

[38]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[40]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[41]  Ivan Laptev,et al.  Local spatio-temporal image features for motion interpretation , 2004 .

[42]  Qi Tian,et al.  Task-Dependent Visual-Codebook Compression , 2012, IEEE Transactions on Image Processing.

[43]  Yi Yang,et al.  Effective transfer tagging from image to video , 2013, TOMCCAP.

[44]  Kai Oliver Arras,et al.  People detection in RGB-D data , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[45]  Yue Gao,et al.  3-D Object Retrieval and Recognition With Hypergraph Analysis , 2012, IEEE Transactions on Image Processing.

[46]  Liujuan Cao,et al.  Single/cross-camera multiple-person tracking by graph matching , 2014, Neurocomputing.

[47]  Anan Liu Bidirectional integrated random fields for human behaviour understanding , 2012 .

[48]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[49]  Yanbing Xue,et al.  Human Action Recognition Via Multi-modality Information , 2014 .

[50]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[51]  Yue Gao,et al.  Camera Constraint-Free View-Based 3-D Object Retrieval , 2012, IEEE Transactions on Image Processing.

[52]  Thierry Artières,et al.  Large margin training for hidden Markov models with partially observed states , 2009, ICML '09.

[53]  Yu-Ting Su,et al.  Single/multi-view human action recognition via regularized multi-task learning , 2015, Neurocomputing.

[54]  Sharon L. Oviatt,et al.  Multimodal Integration - A Statistical View , 1999, IEEE Trans. Multim..

[55]  Wei Liang,et al.  Discriminative human action recognition in the learned hierarchical manifold space , 2010, Image Vis. Comput..

[56]  Jake K. Aggarwal,et al.  Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Yi Yang,et al.  Discriminative Nonnegative Spectral Clustering with Out-of-Sample Extension , 2013, IEEE Transactions on Knowledge and Data Engineering.

[58]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[59]  Shuicheng Yan,et al.  An HOG-LBP human detector with partial occlusion handling , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[60]  Sheng Tang,et al.  Accurate Estimation of Human Body Orientation From RGB-D Sensors , 2013, IEEE Transactions on Cybernetics.

[61]  Anni Cai,et al.  Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset , 2012, Multimedia Tools and Applications.

[62]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.