Learning to Recover 3D Scene Shape from a Single Image

Despite significant progress in monocular depth estimation in the wild, recent state-of-the-art methods cannot be used to recover accurate 3D scene shape due to an unknown depth shift induced by shift-invariant reconstruction losses used in mixed-data depth prediction training, and possible unknown camera focal length. We investigate this problem in detail, and propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image, and then use 3D point cloud encoders to predict the missing depth shift and focal length that allow us to recover a realistic 3D scene shape. In addition, we propose an image-level normalized regression loss and a normal-based geometry loss to enhance depth prediction models trained on mixed datasets. We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot dataset generalization. Code is available at: this https URL

[1]  M. Isard,et al.  Automatic Camera Calibration from a Single Manhattan Image , 2002, ECCV.

[2]  Zhiguo Cao,et al.  Monocular Relative Depth Perception with Web Stereo Data Supervision , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Chunhua Shen,et al.  Enforcing Geometric Constraints of Virtual Normal for Depth Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[7]  Dalwinder Singh,et al.  Investigating the impact of data normalization on classification performance , 2020, Appl. Soft Comput..

[8]  Kwanghoon Sohn,et al.  Deep Monocular Depth Estimation via Integration of Global and Local Predictions , 2018, IEEE Transactions on Image Processing.

[9]  Torsten Sattler,et al.  A Multi-view Stereo Benchmark with High-Resolution Images and Multi-camera Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Matthew R. Walter,et al.  DIODE: A Dense Indoor and Outdoor DEpth Dataset , 2019, ArXiv.

[11]  Connor Greenwell,et al.  DEEPFOCAL: A method for direct focal length estimation , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[12]  Friedrich Fraundorfer,et al.  Evaluation of CNN-based Single-Image Depth Estimation Methods , 2018, ECCV Workshops.

[13]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[14]  J. Hughes,et al.  SmoothSketch: 3D free-form shapes from complex sketches , 2006, SIGGRAPH '06.

[15]  Simon Lucey,et al.  Web Stereo Video Supervision for Depth Prediction from Dynamic Scenes , 2019, 2019 International Conference on 3D Vision (3DV).

[16]  Zhengyou Zhang,et al.  A Flexible New Technique for Camera Calibration , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Yannick Hold-Geoffroy,et al.  A Perceptual Measure for Deep Single Image Camera Calibration , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[20]  Weifeng Chen,et al.  Single-Image Depth Perception in the Wild , 2016, NIPS.

[21]  Hao Chen,et al.  Training Compact Neural Networks via Auxiliary Overparameterization , 2019, ArXiv.

[22]  Michael J. Black,et al.  A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[23]  Jiajun Wu,et al.  Learning Shape Priors for Single-View 3D Completion and Reconstruction , 2018, ECCV.

[24]  Weifeng Chen,et al.  Learning Single-Image Depth From Videos Using Quality Assessment Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Feng Liu,et al.  3D Ken Burns effect from a single image , 2019, ACM Trans. Graph..

[26]  Ruigang Yang,et al.  GA-Net: Guided Aggregation Net for End-To-End Stereo Matching , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[28]  Jia Deng,et al.  OASIS: A Large-Scale Dataset for Single Image 3D in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Olivier D. Faugeras,et al.  Shape from shading: a well-posed problem? , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[31]  Thomas Brox,et al.  CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  M. Gross,et al.  Nonlinear disparity mapping for stereoscopic 3D , 2010, ACM Trans. Graph..

[33]  Silvio Savarese,et al.  Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[34]  Jitendra Malik,et al.  Shape, Illumination, and Reflectance from Shading , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Changming Sun,et al.  DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data , 2020, ArXiv.

[36]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Konrad Schindler,et al.  Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Hanbyul Joo,et al.  PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Song Han,et al.  Point-Voxel CNN for Efficient 3D Deep Learning , 2019, NeurIPS.

[40]  Chunhua Shen,et al.  Virtual Normal: Enforcing Geometric Constraints for Accurate and Robust Depth Prediction , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Krista A. Ehinger,et al.  Recognizing scene viewpoint using panoramic place representation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[43]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[44]  Zhe L. Lin,et al.  SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Tao Kong,et al.  Task-Aware Monocular Depth Estimation for 3D Object Detection , 2020, AAAI.

[46]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yiwen Hua,et al.  Holopix50k: A Large-Scale In-the-wild Stereo Image Dataset , 2020, ArXiv.

[48]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Chunhua Shen,et al.  Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video , 2019, NeurIPS.

[50]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[52]  Zhengqi Li,et al.  MegaDepth: Learning Single-View Depth Prediction from Internet Photos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Zhe L. Lin,et al.  Structure-Guided Ranking Loss for Single Image Depth Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).