Information Fusion for Human Action Recognition via Biset/Multiset Globality Locality Preserving Canonical Correlation Analysis

In this paper, we study the problem of human action recognition, in which each action is captured by multiple sensors and represented by multisets. We propose two novel information fusion techniques for fusing the information from multisets. The first technique is biset globality locality preserving canonical correlation analysis (BGLPCCA), which aims to learn the common feature subspace between two sets. The second technique is multiset globality locality preserving canonical correlation analysis (MGLPCCA), which aims to deal with three or more sets. The proposed BGLPCCA and MGLPCCA are able to learn a low-dimensional common subspace that preserves the local and global structures of data samples. Moreover, two novel descriptors are presented for both depth and skeleton. We then propose a new human action recognition framework employing the proposed BGLPCCA or MGLPCCA to learn the shared subspace from multiple sets of features including skeleton, depth, and optical flow. Extensive experiments on five publicly available datasets (MSR Action3D, UTD multimodal human action dataset, multimodal action database, Kinect activity recognition dataset, and SBU Kinect interaction dataset) demonstrate the effectiveness of the proposed framework.

[1]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Xianglei Xing,et al.  Couple manifold discriminant analysis with bipartite graph embedding for low-resolution face recognition , 2016, Signal Process..

[3]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[4]  Alexandros André Chaaraoui,et al.  Evolutionary joint selection to improve human action recognition with RGB-D devices , 2014, Expert Syst. Appl..

[5]  Shiguang Shan,et al.  Multi-View Discriminant Analysis , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[7]  Dimitrios Makris,et al.  G3D: A gaming action dataset and real time action recognition evaluation framework , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[8]  S. Huang,et al.  Cross-Speed Gait Recognition Using Speed-Invariant Gait Templates and Globality–Locality Preserving Projections , 2015, IEEE Transactions on Information Forensics and Security.

[9]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[10]  Mario Fernando Montenegro Campos,et al.  STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences , 2012, CIARP.

[11]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[12]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[13]  Ruzena Bajcsy,et al.  Continuous, Real-Time, Tele-monitoring of Patients with Chronic Heart-Failure - Lessons Learned From a Pilot Study , 2014, BODYNETS.

[14]  Songcan Chen,et al.  Locality preserving CCA with applications to data visualization and pose estimation , 2007, Image Vis. Comput..

[15]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[17]  Quansen Sun,et al.  Graph regularized multiset canonical correlations with applications to joint feature extraction , 2014, Pattern Recognit..

[18]  Marwan Torki,et al.  Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition , 2013, IJCAI.

[19]  Quan-Sen Sun,et al.  Multiset Canonical Correlations Using Globality Preserving Projections With Applications to Feature Extraction and Recognition , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[21]  Yi Wang,et al.  Sequential Max-Margin Event Detectors , 2014, ECCV.

[22]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Xiaodong Yang,et al.  EigenJoints-based action recognition using Naïve-Bayes-Nearest-Neighbor , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[24]  Mohammed A. Hasan,et al.  On multi-set canonical correlation analysis , 2009, 2009 International Joint Conference on Neural Networks.

[25]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[27]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Quan-Sen Sun,et al.  A unified multiset canonical correlation analysis framework based on graph embedding for multiple feature extraction , 2015, Neurocomputing.

[29]  Vince D. Calhoun,et al.  Multimodal Data Fusion Using Source Separation: Application to Medical Imaging , 2015, Proceedings of the IEEE.

[30]  R. Bajcsy,et al.  Wearable Sensors for Reliable Fall Detection , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[31]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[33]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[34]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[36]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[37]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[38]  Shiguang Shan,et al.  Multi-view Deep Network for Cross-View Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[40]  Ling Shao,et al.  Enhanced Computer Vision With Microsoft Kinect Sensor: A Review , 2013, IEEE Transactions on Cybernetics.

[41]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[42]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[43]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[44]  Yifeng He,et al.  Human action recognition using temporal hierarchical pyramid of depth motion map and KECA , 2015, 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP).

[45]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Nasser Kehtarnavaz,et al.  Improving Human Action Recognition Using Fusion of Depth Camera and Inertial Sensors , 2015, IEEE Transactions on Human-Machine Systems.

[47]  Gang Wang,et al.  Multi-modal feature fusion for action recognition in RGB-D sequences , 2014, 2014 6th International Symposium on Communications, Control and Signal Processing (ISCCSP).

[48]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[49]  Guodong Guo,et al.  Fusing Spatiotemporal Features and Joints for 3D Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[50]  Thomas Brox,et al.  Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[52]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[53]  Ying Wu,et al.  Locality Versus Globality: Query-Driven Localized Linear Models for Facial Image Computing , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[54]  Jian Yang,et al.  Feature fusion: parallel strategy vs. serial strategy , 2003, Pattern Recognit..

[55]  Ilkka Korhonen,et al.  Detection of Daily Activities and Sports With Wearable Sensors in Controlled and Uncontrolled Conditions , 2008, IEEE Transactions on Information Technology in Biomedicine.

[56]  Arif Mahmood,et al.  Discriminative human action classification using locality-constrained linear coding , 2016, Pattern Recognit. Lett..

[57]  Xianglei Xing,et al.  Fusion of Gait and Facial Features using Coupled Projections for People Identification at a Distance , 2015, IEEE Signal Processing Letters.

[58]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[59]  Jake K. Aggarwal,et al.  Human activity recognition from 3D data: A review , 2014, Pattern Recognit. Lett..

[60]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Marco Morana,et al.  Human Activity Recognition Process Using 3-D Posture Data , 2015, IEEE Transactions on Human-Machine Systems.

[62]  Lei Gao,et al.  Discriminative Multiple Canonical Correlation Analysis for Multi-feature Information Fusion , 2012, 2012 IEEE International Symposium on Multimedia.

[63]  Sebastian Nowozin,et al.  Action Points: A Representation for Low-latency Online Human Action Recognition , 2012 .

[64]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[65]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[66]  Ying Wu,et al.  Robust 3D Action Recognition with Random Occupancy Patterns , 2012, ECCV.

[67]  Nikhil Rasiwasia,et al.  Cluster Canonical Correlation Analysis , 2014, AISTATS.

[68]  Jong Hyuk Park,et al.  Intelligent video surveillance system: 3-tier context-aware surveillance system with metadata , 2010, Multimedia Tools and Applications.

[69]  G. Johansson Visual motion perception. , 1975, Scientific American.

[70]  Ivan Laptev,et al.  Local Descriptors for Spatio-temporal Recognition , 2004, SCVMA.

[71]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.