Simple very deep convolutional network for robust hand pose regression from a single depth image

Abstract We propose a novel approach for articulated hand pose estimation from a single depth image using a very deep convolutional network. For the first, a very deep network structure is designed to directly maps a single depth image to its corresponding 3D hand joint locations. This approach eliminates the necessity of hand-crafted intermediate features and sophisticated post-processing stages for robust and accurate hand pose estimation. We use Batch Normalization to accelerate training and prevent the objective function from getting stuck in poor local minima. We introduce a low-dimensional embedding forcing the network to learn the inherent constraints of hand joints, which helps to reduce the cost of reconstructing 3D hand poses from high-dimension feature space. We discuss the effect of the convolutional network depth on its accuracy under the hand pose regression setting. Quantitative assessments on two challenging datasets show that our proposed method gets competitive results to state-of-the-art approaches in terms of accuracy. Moreover, qualitative results also show that our proposed method is robust to some difficult hand poses.

[1]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Stan Sclaroff,et al.  Estimating 3D hand pose from a cluttered image , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[3]  Jian Sun,et al.  Cascaded hand pose regression , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew W. Fitzgibbon,et al.  Accurate, Robust, and Flexible Real-time Hand Tracking , 2015, CHI.

[5]  Vincent Lepetit,et al.  Training a Feedback Loop for Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[7]  Tae-Kyun Kim,et al.  Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Andrew W. Fitzgibbon,et al.  Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences , 2016, ACM Trans. Graph..

[9]  Lale Akarun,et al.  Real time hand pose estimation using depth sensors , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[10]  Christian Wolf,et al.  Hand pose estimation through semi-supervised and weakly-supervised learning , 2015, Comput. Vis. Image Underst..

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Karthik Ramani,et al.  DeepHand: Robust Hand Pose Estimation by Completing a Matrix Imputed with Deep Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Andrea Tagliasacchi,et al.  Robust Articulated-ICP for Real-Time Hand Tracking , 2015 .

[14]  Lale Akarun,et al.  Hand Pose Estimation and Hand Shape Classification Using Multi-layered Randomized Decision Forests , 2012, ECCV.

[15]  Andrew W. Fitzgibbon,et al.  The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Li Cheng,et al.  Efficient Hand Pose Estimation from a Single Depth Image , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Nicholay Topin,et al.  Deep Convolutional Neural Network Design Patterns , 2016, ArXiv.

[18]  Daniel Thalmann,et al.  Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Helge J. Ritter,et al.  Real-time hand tracking using synergistic inverse kinematics , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Mircea Nicolescu,et al.  Vision-based hand pose estimation: A review , 2007, Comput. Vis. Image Underst..

[21]  Yi Yang,et al.  Depth-Based Hand Pose Estimation: Methods, Data, and Challenges , 2015, International Journal of Computer Vision.

[22]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[23]  Qi Ye,et al.  Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation , 2016, ECCV.

[24]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[27]  Li Cheng,et al.  Estimate Hand Poses Efficiently from Single Depth Images , 2015, International Journal of Computer Vision.

[28]  Jovan Popović,et al.  Real-time hand-tracking with a color glove , 2009, SIGGRAPH 2009.

[29]  Luc Van Gool,et al.  Motion Capture of Hands in Action Using Discriminative Salient Points , 2012, ECCV.

[30]  Chen Qian,et al.  Realtime and Robust Hand Tracking from Depth , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.