Learning Human Pose Estimation Features with Convolutional Networks

Abstract: This paper introduces a new architecture for human pose estimation using a multi- layer convolutional network architecture and a modified learning technique that learns low-level features and higher-level weak spatial models. Unconstrained human pose estimation is one of the hardest problems in computer vision, and our new architecture and learning schema shows significant improvement over the current state-of-the-art results. The main contribution of this paper is showing, for the first time, that a specific variation of deep learning is able to outperform all existing traditional architectures on this task. The paper also discusses several lessons learned while researching alternatives, most notably, that it is possible to learn strong low-level feature detectors on features that might even just cover a few pixels in the image. Higher-level spatial models improve somewhat the overall result, but to a much lesser extent then expected. Many researchers previously argued that the kinematic structure and top-down information is crucial for this domain, but with our purely bottom up, and weak spatial model, we could improve other more complicated architectures that currently produce the best results. This mirrors what many other researchers, like those in the speech recognition, object recognition, and other domains have experienced.

[1]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[2]  David A. Forsyth,et al.  Strike a pose: tracking people by finding stylized poses , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  John C. Platt,et al.  A Convolutional Neural Network Hand Tracker , 1994, NIPS.

[4]  Christoph Bregler,et al.  Pose-Sensitive Embedding by Nonlinear NCA Regression , 2010, NIPS.

[5]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[7]  George Kollios,et al.  BoostMap: A method for efficient approximate similarity rankings , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[8]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[9]  Ali Farhadi,et al.  Transfer Learning in Sign language , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Andrew Zisserman,et al.  Pose search: Retrieving people using their pose , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Mircea Nicolescu,et al.  Vision-based hand pose estimation: A review , 2007, Comput. Vis. Image Underst..

[12]  Yann LeCun,et al.  Synergistic Face Detection and Pose Estimation with Energy-Based Models , 2004, J. Mach. Learn. Res..

[13]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[14]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Trevor Darrell,et al.  Inferring 3D structure with a statistical image-based shape model , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[16]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[17]  Michael J. Black,et al.  Coregistration: Simultaneous Alignment and Modeling of Articulated 3D Shape , 2012, ECCV.

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[21]  Cordelia Schmid,et al.  Estimating Human Pose with Flowing Puppets , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Ben Taskar,et al.  Adaptive pose priors for pictorial structures , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[24]  David J. Fleet,et al.  Dynamical binary latent variable models for 3D human pose tracking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Jovan Popovic,et al.  Real-time hand-tracking with a color glove , 2009, SIGGRAPH '09.

[26]  Alex Pentland,et al.  Pfinder: Real-Time Tracking of the Human Body , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[28]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[29]  Luc Van Gool,et al.  Efficient Non-Maximum Suppression , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[30]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Neural Networks , 2013 .

[33]  Joseph F. Murray,et al.  Convolutional Networks Can Learn to Generate Affinity Graphs for Image Segmentation , 2010, Neural Computation.

[34]  Yann LeCun,et al.  Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers , 2012, ICML.

[35]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[37]  Nicolas Pinto,et al.  Why is Real-World Visual Object Recognition Hard? , 2008, PLoS Comput. Biol..

[38]  Sebastian Thrun,et al.  SCAPE: shape completion and animation of people , 2005, SIGGRAPH '05.

[39]  Pascal Fua,et al.  Are spatial and global constraints really necessary for segmentation? , 2011, 2011 International Conference on Computer Vision.

[40]  BlakeAndrew,et al.  Real-time human pose recognition in parts from single depth images , 2013 .

[41]  Jitendra Malik,et al.  Estimating Human Body Configurations Using Shape Context Matching , 2002, ECCV.

[42]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Bernt Schiele,et al.  In Good Shape: Robust People Detection based on Appearance and Shape , 2011, BMVC.

[44]  Ronald Poppe,et al.  Vision-based human motion analysis: An overview , 2007, Comput. Vis. Image Underst..

[45]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[46]  Hans-Peter Seidel,et al.  Fast articulated motion tracking using a sums of Gaussians body model , 2011, 2011 International Conference on Computer Vision.

[47]  Bernt Schiele,et al.  Articulated people detection and pose estimation: Reshaping the future , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Hans-Peter Seidel,et al.  A Statistical Model of Human Pose and Body Shape , 2009, Comput. Graph. Forum.