Single Image 3D without a Single 3D Image

Do we really need 3D labels in order to learn how to predict 3D? In this paper, we show that one can learn a mapping from appearance to 3D properties without ever seeing a single explicit 3D label. Rather than use explicit supervision, we use the regularity of indoor scenes to learn the mapping in a completely unsupervised manner. We demonstrate this on both a standard 3D scene understanding dataset as well as Internet images for which 3D is unavailable, precluding supervised learning. Despite never seeing a 3D label, our method produces competitive results.

[1]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  D. A. Huffman,et al.  Impossible Objects as Nonsense Sentences , 2012 .

[3]  Jitendra Malik,et al.  Computing Local Surface Orientation and Shape from Texture for Curved Surfaces , 1997, International Journal of Computer Vision.

[4]  Jitendra Malik,et al.  Discriminative Decorrelation for Clustering and Classification , 2012, ECCV.

[5]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Jan-Michael Frahm,et al.  3D model matching with Viewpoint-Invariant Patches (VIP) , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[8]  S. Sutherland Seeing things , 1989, Nature.

[9]  David A. Forsyth,et al.  Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry , 2010, ECCV.

[10]  Silvio Savarese,et al.  Layout Estimation of Highly Cluttered Indoor Scenes Using Geometric and Semantic Cues , 2013, ICIAP.

[11]  Derek Hoiem,et al.  Recovering the spatial layout of cluttered rooms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Alexei A. Efros,et al.  Geometric context from a single image , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  Song-Chun Zhu,et al.  Scene Parsing by Integrating Function, Geometry and Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Huamin Wang,et al.  Factoring repeated content within and among images , 2008, ACM Trans. Graph..

[15]  Abhinav Gupta,et al.  Designing deep networks for surface normal estimation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Derek Hoiem,et al.  Support Surface Prediction in Indoor Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Yong Jae Lee,et al.  Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  T. Kanade,et al.  Geometric reasoning for single image structure recovery , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[20]  Alan L. Yuille,et al.  The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference , 2000, NIPS.

[21]  David A. Forsyth,et al.  Shape from Texture without Boundaries , 2002, International Journal of Computer Vision.

[22]  Yi Ma,et al.  TILT: Transform Invariant Low-Rank Textures , 2010, ACCV 2010.

[23]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[24]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[25]  Marc Pollefeys,et al.  Match Box: Indoor Image Matching via Box-Like Scene Estimation , 2014, 2014 2nd International Conference on 3D Vision.

[26]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Silvio Savarese,et al.  Understanding Indoor Scenes Using 3D Geometric Phrases , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Takeo Kanade,et al.  Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces , 2010, NIPS.

[29]  Martial Hebert,et al.  Unfolding an Indoor Origami World , 2014, ECCV.

[30]  Marc Pollefeys,et al.  Discriminatively Trained Dense Surface Normal Estimation , 2014, ECCV.

[31]  Andrew Owens,et al.  Shape Anchors for Data-Driven Multi-view Reconstruction , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Marc Pollefeys,et al.  Learning a Confidence Measure for Optical Flow , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[34]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Silvio Savarese,et al.  Understanding the 3D layout of a cluttered room from multiple images , 2014, IEEE Winter Conference on Applications of Computer Vision.

[36]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[37]  Rodney A. Brooks,et al.  The ACRONYM Model-Based Vision System , 1979, IJCAI.

[38]  Martial Hebert,et al.  Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Sven J. Dickinson,et al.  3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model , 2012, NIPS.

[40]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[41]  Raquel Urtasun,et al.  Efficient Exact Inference for 3D Indoor Scene Understanding , 2012, ECCV.

[42]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.