Make3D: Learning 3D Scene Structure from a Single Still Image

We consider the problem of estimating detailed 3D structure from a single still image of an unstructured environment. Our goal is to create 3D models which are both quantitatively accurate as well as visually pleasing. For each small homogeneous patch in the image, we use a Markov random field (MRF) to infer a set of "plane parameters" that capture both the 3D location and 3D orientation of the patch. The MRF, trained via supervised learning, models both image depth cues as well as the relationships between different parts of the image. Inference in our model is tractable, and requires only solving a convex optimization problem. Other than assuming that the environment is made up of a number of small planes, our model makes no explicit assumptions about the structure of the scene; this enables the algorithm to capture much more detailed 3D structure than does prior art (such as Saxena et ah, 2005, Delage et ah, 2005, and Hoiem et el, 2005), and also give a much richer experience in the 3D flythroughs created using image-based rendering, even for scenes with significant non-vertical structure. Using this approach, we have created qualitatively correct 3D models for 64.9% of 588 images downloaded from the Internet, as compared to Hoiem et al.'s performance of 33.1%. Further, our models are quantitatively more accurate than either Saxena et al. or Hoiem et al.

[1]  Tony Lindeberg,et al.  Shape from texture from a multi-scale perspective , 1993, 1993 (4th) International Conference on Computer Vision.

[2]  Reinhard Koch,et al.  Multi Viewpoint Stereo from Uncalibrated Video Sequences , 1998, ECCV.

[3]  Atsuto Maki,et al.  Geotensity: combining motion and lighting for 3D surface reconstruction , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[4]  Ping-Sing Tsai,et al.  Shape from Shading: A Survey , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  D. Scharstein,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001).

[6]  J. Loomis Looking down is looking up , 2001, Nature.

[7]  Masaaki Ikehara,et al.  HMM-based surface reconstruction from single images , 2002, Proceedings. International Conference on Image Processing.

[8]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[9]  Antonio Torralba,et al.  Depth Estimation from Image Structure , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[11]  Feng Han,et al.  Bayesian reconstruction of 3D shapes and scenes from a single image , 2003, First IEEE International Workshop on Higher-Level Knowledge in 3D Modeling and Motion Analysis, 2003. HLK 2003..

[12]  Ze-Nian Li,et al.  A survey of motion-parallax-based 3-D reconstruction algorithms , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[13]  Jitendra Malik,et al.  Learning to detect natural image boundaries using local brightness, color, and texture cues , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Hiroshi Kikuchi,et al.  Real-time three-dimensional video image composition by depth information , 2004, IEICE Electron. Express.

[15]  Jitendra Malik,et al.  Computing Local Surface Orientation and Shape from Texture for Curved Surfaces , 1997, International Journal of Computer Vision.

[16]  Reinhard Koch,et al.  Visual Modeling with a Hand-Held Camera , 2004, International Journal of Computer Vision.

[17]  Ian D. Reid,et al.  Single View Metrology , 2000, International Journal of Computer Vision.

[18]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[19]  Richard Szeliski,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, International Journal of Computer Vision.

[20]  Ashutosh Saxena,et al.  High speed obstacle avoidance using monocular vision and reinforcement learning , 2005, ICML.

[21]  Honglak Lee,et al.  Automatic Single-Image 3d Reconstructions of Indoor Manhattan World Scenes , 2007, ISRR.

[22]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, ACM Trans. Graph..

[23]  Alexei A. Efros,et al.  Geometric context from a single image , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[24]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[25]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[26]  Christopher Joseph Pal,et al.  Multi-Conditional Learning: Generative/Discriminative Training for Clustering and Classification , 2006, AAAI.

[27]  Honglak Lee,et al.  A Dynamic Bayesian Network Model for Autonomous 3D Reconstruction from a Single Indoor Image , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Antonio Torralba,et al.  Depth from Familiar Objects: A Hierarchical Model for 3D Scenes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[29]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[30]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[31]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[32]  Ronen Basri,et al.  Example Based 3D Reconstruction from Single 2D Images , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[33]  Ashutosh Saxena,et al.  Robotic Grasping of Novel Objects , 2006, NIPS.

[34]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[35]  Andrew McCallum,et al.  Multi-Conditional Learning for Joint Probability Models with Latent Variables , 2006 .

[36]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[37]  Ashutosh Saxena,et al.  Depth Estimation Using Monocular and Stereo Cues , 2007, IJCAI.

[38]  Ashutosh Saxena,et al.  Learning 3-D Scene Structure from a Single Still Image , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[39]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[40]  Bernd Freisleben,et al.  Using depth features to retrieve monocular video shots , 2007, CIVR '07.

[41]  Masaaki Ikehara,et al.  HMM-based surface reconstruction from single images , 2002, Proceedings. International Conference on Image Processing.

[42]  Ashutosh Saxena,et al.  3-D Reconstruction from Sparse Views using Monocular Vision , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[43]  Ashutosh Saxena,et al.  Make3D: Depth Perception from a Single Still Image , 2008, AAAI.

[44]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .