Scene Flow to Action Map: A New Representation for RGB-D Based Action Recognition with Convolutional Neural Networks

Scene flow describes the motion of 3D objects in real world and potentially could be the basis of a good feature for 3D action recognition. However, its use for action recognition, especially in the context of convolutional neural networks (ConvNets), has not been previously studied. In this paper, we propose the extraction and use of scene flow for action recognition from RGB-D data. Previous works have considered the depth and RGB modalities as separate channels and extract features for later fusion. We take a different approach and consider the modalities as one entity, thus allowing feature extraction for action recognition at the beginning. Two key questions about the use of scene flow for action recognition are addressed: how to organize the scene flow vectors and how to represent the long term dynamics of videos based on scene flow. In order to calculate the scene flow correctly on the available datasets, we propose an effective self-calibration method to align the RGB and depth data spatially without knowledge of the camera parameters. Based on the scene flow vectors, we propose a new representation, namely, Scene Flow to Action Map (SFAM), that describes several long term spatio-temporal dynamics for action recognition. We adopt a channel transform kernel to transform the scene flow vectors to an optimal color space analogous to RGB. This transformation takes better advantage of the trained ConvNets models over ImageNet. Experimental results indicate that this new representation can surpass the performance of state-of-the-art methods on two large public datasets.

[1]  Yun Fu,et al.  Bilinear heterogeneous information machine for RGB-D action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yun Fu,et al.  Low-Rank Tensor Subspace Learning for RGB-D Action Recognition , 2016, IEEE Transactions on Image Processing.

[3]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Pichao Wang,et al.  Large-scale Isolated Gesture Recognition using Convolutional Neural Networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[5]  Konrad Schindler,et al.  Piecewise Rigid Scene Flow , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  M. Fukushima,et al.  Erratum to Levenberg-Marquardt methods with strong local convergence properties for solving nonlinear equations with convex constraints , 2005 .

[7]  Mubarak Shah,et al.  Actions sketch: a novel action representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[9]  Andrew W. Fitzgibbon,et al.  SphereFlow: 6 DoF Scene Flow from RGB-D Pairs , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[11]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Hong Wei,et al.  ReadingAct RGB-D action dataset and human action recognition from local features , 2014, Pattern Recognit. Lett..

[13]  Daniel Cremers,et al.  Stereoscopic Scene Flow Computation for 3D Motion Understanding , 2011, International Journal of Computer Vision.

[14]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[15]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[16]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jun Wan,et al.  Explore Efficient Local Features from RGB-D Data for One-Shot Learning Gesture Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[19]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[20]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[21]  Jing Zhang,et al.  ConvNets-Based Action Recognition from Depth Maps through Virtual Cameras and Pseudocoloring , 2015, ACM Multimedia.

[22]  Jian-Huang Lai,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jing Zhang,et al.  RGB-D-based action recognition datasets: A survey , 2016, Pattern Recognit..

[24]  Jing Zhang,et al.  Action Recognition From Depth Maps Using Deep Convolutional Neural Networks , 2016, IEEE Transactions on Human-Machine Systems.

[25]  Richard Bowden,et al.  Scene Particles: Unregularized Particle-Based Scene Flow Estimation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Pichao Wang,et al.  Mining Mid-Level Features for Action Recognition Based on Effective Skeleton Representation , 2014, 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[27]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yael Moses,et al.  Multi-view Scene Flow Estimation: A View Centered Variational Approach , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Frederic Devernay,et al.  A Variational Method for Scene Flow Estimation from Stereo Sequences , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[31]  Pichao Wang,et al.  Skeleton Optical Spectra-Based Action Recognition Using Convolutional Neural Networks , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Sergio Escalera,et al.  ChaLearn Joint Contest on Multimedia Challenges Beyond Visual Analysis: An overview , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[34]  James W. Davis,et al.  An appearance-based representation of action , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Konrad Schindler,et al.  3D scene flow estimation with a rigid motion prior , 2011, 2011 International Conference on Computer Vision.

[37]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Janis Fehr,et al.  Computing Range Flow from Multi-modal Kinect Data , 2011, ISVC.

[39]  Mohan S. Kankanhalli,et al.  Benchmarking a Multimodal and Multiview and Interactive Dataset for Human Action Recognition , 2017, IEEE Transactions on Cybernetics.

[40]  Yong Wang,et al.  Machine Vision and Applications , 2013 .

[41]  Pichao Wang,et al.  Joint Distance Maps Based Action Recognition With Convolutional Neural Networks , 2017, IEEE Signal Processing Letters.

[42]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[43]  Sergio Escalera,et al.  ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[44]  Edmond Boyer,et al.  Scene Flow from Depth and Color Images , 2011, BMVC.

[45]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[46]  Cewu Lu,et al.  Range-Sample Depth Feature for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[48]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[49]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Ling Shao,et al.  Structure-Preserving Binary Representations for RGB-D Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Xiaodong Yang,et al.  Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Bingbing Ni,et al.  Pose Adaptive Motion Feature Pooling for Human Action Analysis , 2014, International Journal of Computer Vision.

[56]  Thomas Brox,et al.  Dense Semi-rigid Scene Flow Estimation from RGBD Images , 2014, ECCV.

[57]  Isabelle Guyon,et al.  The ChaLearn gesture dataset (CGD 2011) , 2014, Machine Vision and Applications.

[58]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Daniel Cremers,et al.  A primal-dual framework for real-time dense RGB-D scene flow , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[60]  H. Zhang,et al.  Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition , 2015, Neurocomputing.

[61]  Kien A. Hua,et al.  Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams , 2015, ICMR.

[62]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Richard Bowden,et al.  Kinecting the dots: Particle based scene flow from depth sensors , 2011, 2011 International Conference on Computer Vision.

[64]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[65]  Wei Li,et al.  3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos , 2014, J. Electronic Imaging.

[66]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[68]  Pichao Wang,et al.  Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[69]  Mohammad H. Mahoor,et al.  Human activity recognition using multi-features and multiple kernel learning , 2014, Pattern Recognit..

[70]  Xiaodong Yang,et al.  EigenJoints-based action recognition using Naïve-Bayes-Nearest-Neighbor , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[71]  Bir Bhanu,et al.  Individual recognition using gait energy image , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[73]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[74]  Xiaogang Wang,et al.  Deep Learning Face Representation from Predicting 10,000 Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[76]  Deqing Sun,et al.  Layered RGBD scene flow estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[78]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Marwan Torki,et al.  Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition , 2013, IJCAI.

[81]  Takeo Kanade,et al.  Three-dimensional scene flow , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[83]  Li Ma,et al.  Coupled hidden conditional random fields for RGB-D human action recognition , 2015, Signal Process..

[84]  Qiang Ji,et al.  A generative restricted Boltzmann machine based method for high-dimensional motion data modeling , 2015, Comput. Vis. Image Underst..

[85]  Ye Zhang,et al.  On 3D scene flow and structure estimation , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[86]  Silvio Savarese,et al.  Watch-n-patch: Unsupervised understanding of actions and relations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Ajmal S. Mian,et al.  Learning a non-linear knowledge transfer model for cross-view action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).