3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation from Single Depth Images

We propose a simple, yet effective approach for real-time hand pose estimation from single depth images using three-dimensional Convolutional Neural Networks (3D CNNs). Image based features extracted by 2D CNNs are not directly suitable for 3D hand pose estimation due to the lack of 3D spatial information. Our proposed 3D CNN taking a 3D volumetric representation of the hand depth image as input can capture the 3D spatial structure of the input and accurately regress full 3D hand pose in a single pass. In order to make the 3D CNN robust to variations in hand sizes and global orientations, we perform 3D data augmentation on the training data. Experiments show that our proposed 3D CNN based approach outperforms state-of-the-art methods on two challenging hand pose datasets, and is very efficient as our implementation runs at over 215 fps on a standard computer with a single GPU.

[1]  Tae-Kyun Kim,et al.  Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Andrew W. Fitzgibbon,et al.  Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences , 2016, ACM Trans. Graph..

[3]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Chen Qian,et al.  Realtime and Robust Hand Tracking from Depth , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Karthik Ramani,et al.  DeepHand: Robust Hand Pose Estimation by Completing a Matrix Imputed with Deep Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Sebastian Scherer,et al.  3D Convolutional Neural Networks for landing zone detection from LiDAR , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Andrew W. Fitzgibbon,et al.  Learning an efficient model of hand shape variation from depth images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[10]  Vincent Lepetit,et al.  Training a Feedback Loop for Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Andrea Tagliasacchi,et al.  Robust Articulated-ICP for Real-Time Hand Tracking , 2015 .

[12]  Jianxiong Xiao,et al.  Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[15]  Andrew W. Fitzgibbon,et al.  The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Tae-Kyun Kim,et al.  Opening the Black Box: Hierarchical Sampling Optimization for Estimating Human Hand Pose , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Fei-Fei Li,et al.  Towards Viewpoint Invariant 3D Human Pose Estimation , 2016, ECCV.

[18]  Daniel Thalmann,et al.  Parsing the Hand in Depth Images , 2014, IEEE Transactions on Multimedia.

[19]  Leonidas J. Guibas,et al.  Volumetric and Multi-view CNNs for Object Classification on 3D Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Karthik Ramani,et al.  A Collaborative Filtering Approach to Real-Time Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Jian Sun,et al.  Cascaded hand pose regression , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[23]  Andrew W. Fitzgibbon,et al.  Accurate, Robust, and Flexible Real-time Hand Tracking , 2015, CHI.

[24]  Luc Van Gool,et al.  Direction matters: hand pose estimation from local surface normals , 2016, ArXiv.

[25]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[26]  Li Cheng,et al.  Efficient Hand Pose Estimation from a Single Depth Image , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[28]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Lale Akarun,et al.  Hand Pose Estimation and Hand Shape Classification Using Multi-layered Randomized Decision Forests , 2012, ECCV.

[30]  Yichen Wei,et al.  Model-Based Deep Hand Pose Estimation , 2016, IJCAI.

[31]  Vincent Lepetit,et al.  Efficiently Creating 3D Training Data for Fine Hand Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Daniel Thalmann,et al.  Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[34]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[35]  Vincent Lepetit,et al.  Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[36]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Daniel Thalmann,et al.  Resolving Ambiguous Hand Pose Predictions by Exploiting Part Correlations , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[38]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[39]  Qi Ye,et al.  Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation , 2016, ECCV.

[40]  Niloy J. Mitra,et al.  Learning Semantic Deformation Flows with 3D Convolutional Networks , 2016, ECCV.

[41]  Antti Oulasvirta,et al.  Fast and robust hand tracking using detection-guided optimization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Luc Van Gool,et al.  Motion Capture of Hands in Action Using Discriminative Salient Points , 2012, ECCV.