Crossing Nets: Dual Generative Models with a Shared Latent Space for Hand Pose Estimation

State-of-the-art methods for 3D hand pose estimation from depth images require large amounts of annotated training data. We propose to model the statistical relationships of 3D hand poses and corresponding depth images using two deep generative models with a shared latent space. By design, our architecture allows for learning from unlabeled image data in a semi-supervised manner. Assuming a one-to-one mapping between a pose and a depth map, any given point in the shared latent space can be projected into both a hand pose and a corresponding depth map. Regressing the hand pose can then be done by learning a discriminator to estimate the posterior of the latent pose given some depth map. To improve generalization and to better exploit unlabeled depth maps, we jointly train a generator and a discriminator. At each iteration, the generator is updated with the back-propagated gradient from the discriminator to synthesize realistic depth maps of the articulated hand, while the discriminator benefits from an augmented training set of synthesized and unlabeled samples. The proposed discriminator network architecture is highly efficient and runs at 90 FPS on the CPU with accuracies comparable or better than state-of-art on 3 publicly available benchmarks.

[1]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[2]  Vincent Lepetit,et al.  Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[3]  Andrew W. Fitzgibbon,et al.  Accurate, Robust, and Flexible Real-time Hand Tracking , 2015, CHI.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[6]  Ying Wu,et al.  Modeling the constraints of human hand motion , 2000, Proceedings Workshop on Human Motion.

[7]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[8]  Qi Ye,et al.  Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation , 2016, ECCV.

[9]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[10]  Max Jaderberg,et al.  Unsupervised Learning of 3D Structure from Images , 2016, NIPS.

[11]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[12]  Vincent Lepetit,et al.  Fine Hand Segmentation using Convolutional Neural Networks , 2016, ArXiv.

[13]  Karthik Ramani,et al.  A Collaborative Filtering Approach to Real-Time Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Vincent Lepetit,et al.  Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[15]  Tae-Kyun Kim,et al.  Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Chen Qian,et al.  Realtime and Robust Hand Tracking from Depth , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Jost Tobias Springenberg,et al.  Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks , 2015, ICLR.

[18]  Neil D. Lawrence,et al.  Gaussian Process Latent Variable Models for Human Pose Estimation , 2007, MLMI.

[19]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[20]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[21]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[22]  Andrew W. Fitzgibbon,et al.  The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[24]  Luc Van Gool,et al.  Hand Pose Estimation from Local Surface Normals , 2016, ECCV.

[25]  Marc H. Schieber,et al.  Advancing brain-machine interfaces: moving beyond linear state space models , 2015, Front. Syst. Neurosci..

[26]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[27]  Daniel Thalmann,et al.  Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Tae-Kyun Kim,et al.  Opening the Black Box: Hierarchical Sampling Optimization for Estimating Human Hand Pose , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[30]  Edoardo Battaglia,et al.  Exploiting hand kinematic synergies and wearable under-sensing for hand functional grasp recognition , 2014, 2014 4th International Conference on Wireless Mobile Communication and Healthcare - Transforming Healthcare Through Innovations in Mobile and Wireless Technologies (MOBIHEALTH).

[31]  Vashisht Madhavan,et al.  Image Generation from Captions Using Dual-Loss Generative Adversarial Networks , 2016 .

[32]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[33]  Vincent Lepetit,et al.  Training a Feedback Loop for Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[35]  Augustus Odena,et al.  Semi-Supervised Learning with Generative Adversarial Networks , 2016, ArXiv.

[36]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[37]  Ming-Yu Liu,et al.  Coupled Generative Adversarial Networks , 2016, NIPS.

[38]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[39]  Jian Sun,et al.  Cascaded hand pose regression , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[41]  Tae-Kyun Kim,et al.  Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Andrew W. Fitzgibbon,et al.  Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences , 2016, ACM Trans. Graph..

[43]  Karthik Ramani,et al.  DeepHand: Robust Hand Pose Estimation by Completing a Matrix Imputed with Deep Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Andrew W. Fitzgibbon,et al.  The Joint Manifold Model for Semi-supervised Multi-valued Regression , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[45]  Lior Wolf,et al.  Unsupervised Cross-Domain Image Generation , 2016, ICLR.

[46]  Abhinav Gupta,et al.  Generative Image Modeling Using Style and Structure Adversarial Networks , 2016, ECCV.

[47]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[48]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.