Monocular 3D object recognition

Object recognition is one of the fundamental tasks of computer vision. Recent advances in the field enable reliable 2D detections from a single cluttered image. However, many challenges still remain. Object detection needs timely response for real world applications. Moreover, we are genuinely interested in estimating the 3D pose and shape of an object or human for the sake of robotic manipulation and human-robot interaction. In this thesis, a suite of solutions to these challenges is presented. First, Active Deformable Part Models (ADPM) is proposed for fast part-based object detection. ADPM dramatically accelerates the detection by dynamically scheduling the part evaluations and efficiently pruning the image locations. Second, we unleash the power of marrying discriminative 2D parts with an explicit 3D geometric representation. Several methods of such scheme are proposed for recovering rich 3D information of both rigid and non-rigid objects from monocular RGB images. (1) The accurate 3D pose of an object instance is recovered from cluttered images using only the CAD model. (2) A global optimal solution for simultaneous 2D part localization, 3D pose and shape estimation is obtained by optimizing a unified convex objective function. Both appearance and geometric compatibility are jointly maximized. (3) 3D human pose estimation from an image sequence is realized via an Expectation-Maximization algorithm. The 2D joint location uncertainties are marginalized out during inference and 3D pose smoothness is enforced across frames. By bridging the gap between 2D and 3D, our methods provide an end-to-end solution to 3D object recognition from images. We demonstrate a range of interesting applications using only a single image or a monocular video, including autonomous robotic grasping with a single image, 3D object image pop-up and a monocular human MoCap system. We also show empirical start-of-art results on a number of benchmarks on 2D detection and 3D pose and shape estimation. Degree Type Dissertation Degree Name Doctor of Philosophy (PhD) Graduate Group Computer and Information Science First Advisor Kostas Daniilidis

[1]  Timothy F. Cootes,et al.  Feature Detection and Tracking with Constrained Local Models , 2006, BMVC.

[2]  Hsi-Jian Lee,et al.  Determination of 3D human body postures from a single view , 1985, Comput. Vis. Graph. Image Process..

[3]  Feng Han,et al.  Bayesian reconstruction of 3D shapes and scenes from a single image , 2003, First IEEE International Workshop on Higher-Level Knowledge in 3D Modeling and Motion Analysis, 2003. HLK 2003..

[4]  Francesc Moreno-Noguer,et al.  Efficient 3D Object Detection using Multiple Pose-Specific Classifiers , 2011, BMVC.

[5]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[6]  Takeo Kanade,et al.  Trajectory Space: A Dual Representation for Nonrigid Structure from Motion , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jitendra Malik,et al.  Discriminative Decorrelation for Clustering and Classification , 2012, ECCV.

[8]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[9]  Takeo Kanade,et al.  3D Alignment of Face in a Single Image , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Jordi Gonzàlez,et al.  A coarse-to-fine approach for fast deformable object detection , 2011, CVPR 2011.

[12]  Silvio Savarese,et al.  Estimating the aspect layout of object categories , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Ben Taskar,et al.  Cascaded Models for Articulated Pose Estimation , 2010, ECCV.

[14]  Donald Geman,et al.  Coarse-to-Fine Face Detection , 2004, International Journal of Computer Vision.

[15]  Jonathan Tompson,et al.  Learning Human Pose Estimation Features with Convolutional Networks , 2013, ICLR.

[16]  Silvio Savarese,et al.  Depth-Encoded Hough Voting for Joint Object Detection and Shape Recovery , 2010, ECCV.

[17]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[19]  Peter V. Gehler,et al.  3D2PM - 3D Deformable Part Models , 2012, ECCV.

[20]  Paul J. Besl,et al.  A Method for Registration of 3-D Shapes , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[23]  Fernando De la Torre,et al.  Spatio-temporal Matching for Human Detection in Video , 2014, ECCV.

[24]  George J. Pappas,et al.  Localization from semantic observations via the matrix permanent , 2016, Int. J. Robotics Res..

[25]  Jitendra Malik,et al.  Tracking people with twists and exponential maps , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[26]  Peter V. Gehler,et al.  Teaching 3D geometry to deformable part models , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  J. Orbach Principles of Neurodynamics. Perceptrons and the Theory of Brain Mechanisms. , 1962 .

[28]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Jitendra Malik,et al.  Training Deformable Part Models with Decorrelated Features , 2013, 2013 IEEE International Conference on Computer Vision.

[30]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[31]  Kristen Grauman,et al.  Efficient region search for object detection , 2011, CVPR 2011.

[32]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[33]  Cristian Sminchisescu,et al.  Kinematic jump processes for monocular 3D human tracking , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[34]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[35]  Sinisa Todorovic,et al.  From contours to 3D object detection and pose estimation , 2011, 2011 International Conference on Computer Vision.

[36]  Jonathan Brandt,et al.  Robust object detection via soft cascade , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[37]  Matthew B. Blaschko,et al.  Learning a category independent object detection cascade , 2011, 2011 International Conference on Computer Vision.

[38]  Sven J. Dickinson,et al.  3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model , 2012, NIPS.

[39]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time dynamic 3D surface reconstruction and interaction , 2011, SIGGRAPH '11.

[41]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[42]  Feng Wu,et al.  Efficient 2D-to-3D Correspondence Filtering for Scalable 3D Object Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[45]  Cordelia Schmid,et al.  On Pencils of Tangent Planes and the Recognition of Smooth 3D Shapes from Silhouettes , 2002, ECCV.

[46]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[47]  Ben Taskar,et al.  Shape-Based Object Detection via Boundary Structure Segmentation , 2012, International Journal of Computer Vision.

[48]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Michael Isard,et al.  Loose-limbed People: Estimating 3D Human Pose and Motion Using Non-parametric Belief Propagation , 2011, International Journal of Computer Vision.

[50]  Bamdev Mishra,et al.  Manopt, a matlab toolbox for optimization on manifolds , 2013, J. Mach. Learn. Res..

[51]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[52]  Henning Biermann,et al.  Recovering non-rigid 3D shape from image streams , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[53]  David J. Fleet,et al.  Video-Based People Tracking , 2010, Handbook of Ambient Intelligence and Smart Environments.

[54]  Radu Horaud,et al.  New Methods for Matching 3-D Objects with Single Perspective Views , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Jitendra Malik,et al.  Recovering 3D human body configurations using shape contexts , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[57]  Tom Goldstein,et al.  The Split Bregman Method for L1-Regularized Problems , 2009, SIAM J. Imaging Sci..

[58]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[59]  Dieter Fox,et al.  A Scalable Tree-Based Approach for Joint Object and Pose Recognition , 2011, AAAI.

[60]  Cordelia Schmid,et al.  Viewpoint-independent object class detection using 3D Feature Maps , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Bernt Schiele,et al.  Detailed 3D Representations for Object Recognition and Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  George J. Pappas,et al.  Active Deformable Part Models Inference , 2014, ECCV.

[63]  Cordelia Schmid,et al.  Multi-view object class detection with a 3D geometric model , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[64]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Iasonas Kokkinos,et al.  Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound , 2011, NIPS.

[66]  Vincent Lepetit,et al.  Dominant orientation templates for real-time detection of texture-less objects , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[67]  João Paulo Costeira,et al.  A Global Solution to Sparse Correspondence Problems , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[68]  Piotr Dollár,et al.  Crosstalk Cascades for Frame-Rate Pedestrian Detection , 2012, ECCV.

[69]  Simon Lucey,et al.  Complex Non-rigid Motion 3D Reconstruction by Union of Subspaces , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[71]  Gregory D. Hager,et al.  Fast and Globally Convergent Pose Estimation from Video Images , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[72]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[73]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[74]  Tae-Kyun Kim,et al.  Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[76]  Yaser Sheikh,et al.  3D reconstruction of a smooth articulated trajectory from a monocular image sequence , 2011, 2011 International Conference on Computer Vision.

[77]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[78]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[79]  Larry S. Davis,et al.  Jointly Optimizing 3D Model Fitting and Fine-Grained Classification , 2014, ECCV.

[80]  Ben Taskar,et al.  Structured Prediction Cascades , 2010, AISTATS.

[81]  James M. Rehg,et al.  On the Design of Cascades of Boosted Ensembles for Face Detection , 2008, International Journal of Computer Vision.

[82]  Youjie Zhou,et al.  Pose Locality Constrained Representation for 3D Human Pose Reconstruction , 2014, ECCV.

[83]  Alan L. Yuille,et al.  Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[84]  Cristian Sminchisescu 3D Human Motion Analysis in Monocular Video Techniques and Challenges , 2006, AVSS.

[85]  Stella X. Yu,et al.  Linear Scale and Rotation Invariant Matching , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[86]  Charles R. Dyer,et al.  Visibility, occlusion, and the aspect graph , 1990, International Journal of Computer Vision.

[87]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[88]  Benjamin B. Kimia,et al.  3D Object Recognition Using Shape Similarity-Based Aspect Graph , 2001, ICCV.

[89]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[90]  Francesc Moreno-Noguer,et al.  A Joint Model for 2D and 3D Pose Estimation from a Single Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[91]  Xiaowei Zhou,et al.  Sparse Representation for 3D Shape Estimation: A Convex Relaxation Approach , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  Christoph H. Lampert An efficient divide-and-conquer cascade for nonlinear object detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[93]  Wenze Hu,et al.  Learning 3D Object Templates by Quantizing Geometry and Appearance Spaces , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[94]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[95]  Camillo J. Taylor,et al.  Reconstruction of Articulated Objects from Point Correspondences in a Single Uncalibrated Image , 2000, Comput. Vis. Image Underst..

[96]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[97]  Xiaowei Zhou,et al.  3D Shape Reconstruction from 2D Landmarks: A Convex Formulation , 2014, ArXiv.

[98]  Raquel Urtasun,et al.  Implicitly Constrained Gaussian Process Regression for Monocular Non-Rigid Pose Estimation , 2010, NIPS.

[99]  Tinne Tuytelaars,et al.  Is 2D Information Enough For Viewpoint Estimation? , 2014, BMVC.

[100]  Simon Lucey,et al.  Deterministic 3D Human Pose Estimation Using Rigid Structure , 2010, ECCV.

[101]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[102]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[103]  Silvio Savarese,et al.  3D generic object categorization, localization and pose estimation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[104]  Ronen Basri,et al.  Contour-based joint clustering of multiple segmentations , 2011, CVPR 2011.

[105]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[106]  Gary R. Bradski,et al.  Fast 3D recognition and pose using the Viewpoint Feature Histogram , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[107]  Pascal Fua,et al.  Fast Object Detection with Entropy-Driven Evaluation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[108]  Xiaofeng Ren,et al.  Discriminative Mixture-of-Templates for Viewpoint Classification , 2010, ECCV.

[109]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[110]  Radu Bogdan Rusu,et al.  3D is here: Point Cloud Library (PCL) , 2011, 2011 IEEE International Conference on Robotics and Automation.

[111]  Mubarak Shah,et al.  3D Model based Object Class Detection in An Arbitrary View , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[112]  Ronen Basri,et al.  Recognition by Linear Combinations of Models , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[113]  Vincent Lepetit,et al.  Predicting People's 3D Poses from Short Sequences , 2015, ArXiv.

[114]  Jonathan Warrell,et al.  Proposal generation for object detection using cascaded ranking SVMs , 2011, CVPR 2011.

[115]  Junzhou Huang,et al.  Optimal object matching via convexification and composition , 2011, 2011 International Conference on Computer Vision.

[116]  Luc Van Gool,et al.  Branch&Rank: Non-Linear Object Detection , 2011, BMVC.

[117]  Ian D. Reid,et al.  Simultaneous Monocular 2D Segmentation, 3D Pose Recovery and 3D Reconstruction , 2012, ACCV.

[118]  Song-Chun Zhu,et al.  Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[119]  Alvaro Collet,et al.  Making specific features less discriminative to improve point-based 3D object recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[120]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[121]  Hongdong Li,et al.  A simple prior-free method for non-rigid structure-from-motion factorization , 2012, CVPR.

[122]  William Grimson,et al.  Object recognition by computer - the role of geometric constraints , 1991 .

[123]  Ronen Basri Recognition by prototypes , 2004, International Journal of Computer Vision.

[124]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[125]  David J. Kriegman,et al.  Curve and Surface Duals and the Recognition of Curved 3D Objects from their Silhouettes , 2004, International Journal of Computer Vision.

[126]  Gerd Häusler,et al.  Feature-Based Object Recognition and Localization in 3D-Space, Using a Single Video Image , 1999, Comput. Vis. Image Underst..

[127]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[128]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[129]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[130]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[131]  Hao Jiang 3D Human Pose Reconstruction Using Millions of Exemplars , 2010, 2010 20th International Conference on Pattern Recognition.

[132]  Luc Van Gool,et al.  Fast PRISM: Branch and Bound Hough Transform for Object Class Detection , 2011, International Journal of Computer Vision.

[133]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[134]  Trevor Darrell,et al.  Gaussian Processes for Object Categorization , 2010, International Journal of Computer Vision.

[135]  Songhwai Oh,et al.  Complex Non-rigid 3D Shape Recovery Using a Procrustean Normal Distribution Mixture Model , 2015, International Journal of Computer Vision.

[136]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[137]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[138]  Xiaowei Zhou,et al.  Single Image Pop-Up from Discriminatively Learned Parts , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[139]  Kostas Daniilidis,et al.  Single image 3D object detection and pose estimation for grasping , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[140]  Antonio Torralba,et al.  Parsing IKEA Objects: Fine Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[141]  Deva Ramanan,et al.  Analyzing 3D Objects in Cluttered Images , 2012, NIPS.

[142]  Rita Cucchiara,et al.  Multistage Particle Windows for Fast and Accurate Object Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[143]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[144]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[145]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[146]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[147]  Wen Gao,et al.  Robust Estimation of 3D Human Poses from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[148]  Ronen Basri,et al.  Viewpoint-aware object detection and pose estimation , 2011, 2011 International Conference on Computer Vision.