Object detection, shape recovery, and 3D modelling by depth-encoded hough voting

Detecting objects, estimating their pose, and recovering their 3D shape are critical problems in many vision and robotics applications. This paper addresses the above needs using a two stages approach. In the first stage, we propose a new method called DEHV - Depth-Encoded Hough Voting. DEHV jointly detects objects, infers their categories, estimates their pose, and infers/decodes objects depth maps from either a single image (when no depth maps are available in testing) or a single image augmented with depth map (when this is available in testing). Inspired by the Hough voting scheme introduced in [1], DEHV incorporates depth information into the process of learning distributions of image features (patches) representing an object category. DEHV takes advantage of the interplay between the scale of each object patch in the image and its distance (depth) from the corresponding physical patch attached to the 3D object. Once the depth map is given, a full reconstruction is achieved in a second (3D modelling) stage, where modified or state-of-the-art 3D shape and texture completion techniques are used to recover the complete 3D model. Extensive quantitative and qualitative experimental analysis on existing datasets [2-4] and a newly proposed 3D table-top object category dataset shows that our DEHV scheme obtains competitive detection and pose estimation results. Finally, the quality of 3D modelling in terms of both shape completion and texture completion is evaluated on a 3D modelling dataset containing both in-door and out-door object categories. We demonstrate that our overall algorithm can obtain convincing 3D shape reconstruction from just one single uncalibrated image.

[1]  Leonard McMillan,et al.  Plenoptic modeling: an image-based rendering system , 1995, SIGGRAPH.

[2]  Reinhard Koch,et al.  Visual Modeling with a Hand-Held Camera , 2004, International Journal of Computer Vision.

[3]  Silvio Savarese,et al.  3D generic object categorization, localization and pose estimation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  Cordelia Schmid,et al.  3D object modeling and recognition using affine-invariant patches and multi-view spatial constraints , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[5]  Daniel Cremers,et al.  Non-parametric Single View Reconstruction of Curved Objects Using Convex Optimization , 2009, DAGM-Symposium.

[6]  Richard Szeliski,et al.  High-quality video view interpolation using a layered representation , 2004, SIGGRAPH 2004.

[7]  Luc Van Gool,et al.  Depth-From-Recognition: Inferring Meta-data by Cognitive Feedback , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  John F. Hughes,et al.  SmoothSketch: 3D free-form shapes from complex sketches , 2006, SIGGRAPH '06.

[9]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[10]  Frédéric Jurie,et al.  Groups of Adjacent Contour Segments for Object Detection , 2008, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  B. Schiele,et al.  Combined Object Categorization and Segmentation With an Implicit Shape Model , 2004 .

[12]  Dana H. Ballard,et al.  Generalizing the Hough transform to detect arbitrary shapes , 1981, Pattern Recognit..

[13]  Thomas A. Funkhouser,et al.  The Princeton Shape Benchmark (Figures 1 and 2) , 2004, Shape Modeling International Conference.

[14]  Richard Szeliski,et al.  Building Rome in a day , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  T. Kanade,et al.  Geometric reasoning for single image structure recovery , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  David G. Lowe,et al.  Local feature view clustering for 3D object recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[17]  Michael Bosse,et al.  Calibrated, Registered Images of an Extended Urban Area , 2003, International Journal of Computer Vision.

[18]  Dani Lischinski,et al.  Deep photo: model-based photograph enhancement and viewing , 2008, SIGGRAPH Asia '08.

[19]  Ronen Basri,et al.  Constructing implicit 3D shape models for pose estimation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Antonio Criminisi,et al.  Creating Architectural Models from Images , 1999, Comput. Graph. Forum.

[21]  Jitendra Malik,et al.  Modeling and Rendering Architecture from Photographs: A hybrid geometry- and image-based approach , 1996, SIGGRAPH.

[22]  Shimon Ullman,et al.  Recognizing solid objects by alignment with an image , 1990, International Journal of Computer Vision.

[23]  Marc Levoy,et al.  Real-time 3D model acquisition , 2002, ACM Trans. Graph..

[24]  Alberto Del Bimbo,et al.  Metric 3D reconstruction and texture acquisition of surfaces of revolution from a single uncalibrated view , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[26]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  Stephen Gould,et al.  Discriminative learning with latent variables for cluttered indoor scene understanding , 2010, CACM.

[28]  Leonidas J. Guibas,et al.  Example-Based 3D Scan Completion , 2005 .

[29]  Andrew W. Fitzgibbon,et al.  Single View Reconstruction of Curved Surfaces , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[30]  Sung Yong Shin,et al.  On pixel-based texture synthesis by non-parametric sampling , 2006, Comput. Graph..

[31]  Thomas A. Funkhouser,et al.  The Princeton Shape Benchmark , 2004, Proceedings Shape Modeling Applications, 2004..

[32]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[33]  Daniel G. Aliaga,et al.  Sea of images , 2002, IEEE Visualization, 2002. VIS 2002..

[34]  Siddhartha S. Srinivasa,et al.  Object recognition and full pose registration from a single image for robotic manipulation , 2009, 2009 IEEE International Conference on Robotics and Automation.

[35]  Derek Hoiem,et al.  3D LayoutCRF for Multi-View Object Class Recognition and Segmentation , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Marc Levoy,et al.  The digital Michelangelo project: 3D scanning of large statues , 2000, SIGGRAPH.

[37]  Silvio Savarese,et al.  Representations and Techniques for 3D Object Recognition and Scene Interpretation , 2011, Representations and Techniques for 3D Object Recognition and Scene Interpretation.

[38]  Cordelia Schmid,et al.  Viewpoint-independent object class detection using 3D Feature Maps , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Silvio Savarese,et al.  Depth-Encoded Hough Voting for Joint Object Detection and Shape Recovery , 2010, ECCV.

[40]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, ACM Trans. Graph..

[41]  Andrew Blake,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[42]  Silvio Savarese,et al.  View Synthesis for Recognizing Unseen Poses of Object Classes , 2008, ECCV.

[43]  Roberto Cipolla,et al.  Modelling and Interpretation of Architecture from Several Images , 2004, International Journal of Computer Vision.

[44]  Andrew W. Fitzgibbon,et al.  Finding nemo: Deformable object class modelling using curve matching , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  Ankur Agarwal,et al.  Incorporating On-demand Stereo for Real Time Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Nico Blodow,et al.  Close-range scene segmentation and reconstruction of 3D point cloud maps for mobile manipulation in domestic environments , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[47]  Patrick Pérez,et al.  Object removal by exemplar-based inpainting , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[48]  Pietro Perona,et al.  Visual navigation using a single camera , 1995, Proceedings of IEEE International Conference on Computer Vision.

[49]  Ali Farhadi,et al.  A latent model of discriminative aspect , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[50]  Marc Levoy,et al.  Light field rendering , 1996, SIGGRAPH.

[51]  Alexei A. Efros,et al.  Scene completion using millions of photographs , 2007, SIGGRAPH 2007.

[52]  Richard Szeliski,et al.  A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[53]  Ariel Shamir,et al.  Seam carving for media retargeting , 2009, CACM.

[54]  Harry Shum,et al.  Sketching reality: Realistic interpretation of architectural designs , 2008, TOGS.

[55]  Ken-ichi Anjyo,et al.  Tour into the picture: using a spidery mesh interface to make animation from a single image , 1997, SIGGRAPH.

[56]  A. Laurentini,et al.  The Visual Hull Concept for Silhouette-Based Image Understanding , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Loong Fah Cheong,et al.  Symmetric architecture modeling with a single image , 2009, ACM Trans. Graph..

[58]  Kiriakos N. Kutulakos,et al.  A Theory of Shape by Space Carving , 2000, International Journal of Computer Vision.

[59]  Silvio Savarese,et al.  Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[60]  David A. Forsyth,et al.  Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry , 2010, ECCV.

[61]  Paulo R. S. Mendonça,et al.  Camera Pose Estimation and Reconstruction from Image Profiles under Circular Motion , 2000, ECCV.

[62]  Pietro Perona,et al.  A sparse object category model for efficient learning and exhaustive recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[63]  Juergen Gall,et al.  Class-specific Hough forests for object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Luc Van Gool,et al.  Using Multi-view Recognition and Meta-data Annotation to Guide a Robot's Attention , 2009, Int. J. Robotics Res..

[65]  Pietro Perona,et al.  3D Reconstruction by Shadow Carving: Theory and Practical Evaluation , 2007, International Journal of Computer Vision.

[66]  Jitendra Malik,et al.  Multi-scale object detection by clustering lines , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[67]  Mubarak Shah,et al.  3D Model based Object Class Detection in An Arbitrary View , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[68]  Marc Pollefeys,et al.  Efficient structured prediction for 3D indoor scene understanding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Jitendra Malik,et al.  Object detection using a max-margin Hough transform , 2009, CVPR.

[70]  Takeo Kanade,et al.  A statistical approach to 3d object detection applied to faces and cars , 2000 .

[71]  Sylvain Paris,et al.  Error-Tolerant Image Compositing , 2010, ECCV.

[72]  Sylvain Paris,et al.  Error-Tolerant Image Compositing , 2010, International Journal of Computer Vision.

[73]  Bobby Bodenheimer,et al.  Synthesis and evaluation of linear motion transitions , 2008, TOGS.

[74]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..