Single Image Depth Prediction Made Better: A Multivariate Gaussian Take

Neural-network-based single image depth prediction (SIDP) is a challenging task where the goal is to predict the scene's per-pixel depth at test time. Since the problem, by definition, is ill-posed, the fundamental goal is to come up with an approach that can reliably model the scene depth from a set of training examples. In the pursuit of perfect depth estimation, most existing state-of-the-art learning techniques predict a single scalar depth value per-pixel. Yet, it is well-known that the trained model has accuracy limits and can predict imprecise depth. Therefore, an SIDP approach must be mindful of the expected depth variations in the model's prediction at test time. Accordingly, we introduce an approach that performs continuous modeling of per-pixel depth, where we can predict and reason about the per-pixel depth and its distribution. To this end, we model per-pixel scene depth using a multivariate Gaussian distribution. Moreover, contrary to the existing uncertainty modeling methods -- in the same spirit, where per-pixel depth is assumed to be independent, we introduce per-pixel covariance modeling that encodes its depth dependency w.r.t all the scene points. Unfortunately, per-pixel depth covariance modeling leads to a computationally expensive continuous loss function, which we solve efficiently using the learned low-rank approximation of the overall covariance matrix. Notably, when tested on benchmark datasets such as KITTI, NYU, and SUN-RGB-D, the SIDP model obtained by optimizing our loss function shows state-of-the-art results. Our method's accuracy (named MG) is among the top on the KITTI depth-prediction benchmark leaderboard.

[1]  L. Gool,et al.  VA-DepthNet: A Variational Approach to Single Image Depth Prediction , 2023, ICLR.

[2]  Chetan Arora,et al.  Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[3]  Vladlen Koltun,et al.  Enhancing Photorealism Enhancement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  L. Gool,et al.  Robustifying the Multi-Scale Representation of Neural Radiance Fields , 2022, BMVC.

[5]  R. Cipolla,et al.  IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty , 2022, BMVC.

[6]  Vasileios Belagiannis,et al.  Gradient-based Uncertainty for Monocular Depth Estimation , 2022, ECCV.

[7]  Xiaodong Gu,et al.  Neural Window Fully-connected CRFs for Monocular Depth Estimation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Junjun Jiang,et al.  BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation , 2022, ArXiv.

[9]  Xianming Liu,et al.  DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation , 2022, Machine Intelligence Research.

[10]  Jingkuan Song,et al.  Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  L. Gool,et al.  Uncertainty-Aware Deep Multi-View Photometric Stereo , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Pratul P. Srinivasan,et al.  Dense Depth Priors for Neural Radiance Fields from Sparse Input Views , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Konrad Schindler,et al.  Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  D. Palomar,et al.  Covariance Matrix Estimation Under Low-Rank Factor Model With Nonnegative Correlations , 2022, IEEE Transactions on Signal Processing.

[15]  Yiguang Liu,et al.  Gaussian Fusion: Accurate 3D Reconstruction via Geometry-Guided Displacement Interpolation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Junmo Kim,et al.  Patch-Wise Attention Network for Monocular Depth Estimation , 2021, AAAI.

[17]  C. Theobalt,et al.  Adaptive Surface Normal Constraint for Depth Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Vladlen Koltun,et al.  Vision Transformers for Dense Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Nicu Sebe,et al.  Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Alan Yuille,et al.  ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Peter Wonka,et al.  AdaBins: Depth Estimation Using Adaptive Bins , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Gernot Riegler,et al.  Stable View Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  E. Hullermeier,et al.  Monocular Depth Estimation via Listwise Ranking using the Plackett-Luce Model , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Md. Amirul Islam,et al.  Bidirectional Attention Network for Monocular Depth Estimation , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Hongdong Li,et al.  Superpixel Soup: Monocular Dense 3D Reconstruction of a Complex Dynamic Scene , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  L. Gool,et al.  Deep Line Encoding for Monocular 3D Object Detection and Depth Prediction , 2021, BMVC.

[28]  Zhiwei Xiong,et al.  Transformer-based Monocular Depth Estimation with Attention Supervision , 2021, BMVC.

[29]  Jiayi Wang,et al.  Low-Rank Covariance Function Estimation for Multidimensional Functional Data , 2020, Journal of the American Statistical Association.

[30]  Ko Nishino,et al.  3D-GMNet: Single-View 3D Shape Recovery as A Gaussian Mixture , 2020, BMVC.

[31]  Zhe L. Lin,et al.  Structure-Guided Ranking Loss for Single Image Depth Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Stefano Mattoccia,et al.  On the Uncertainty of Self-Supervised Monocular Depth Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  D. Rus,et al.  Deep Evidential Regression , 2019, NeurIPS.

[34]  Gregory Hager,et al.  Dense Depth Estimation in Monocular Endoscopy With Self-Supervised Learning Methods , 2019, IEEE Transactions on Medical Imaging.

[35]  Juraj Kabzan,et al.  Cautious Model Predictive Control Using Gaussian Process Regression , 2017, IEEE Transactions on Control Systems Technology.

[36]  Chunhua Shen,et al.  Enforcing Geometric Constraints of Virtual Normal for Depth Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Il Hong Suh,et al.  From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation , 2019, ArXiv.

[38]  Takayuki Okatani,et al.  Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps With Accurate Object Boundaries , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39]  Gholamali Montazer,et al.  Playing for Depth , 2018, ArXiv.

[40]  Andrea Vedaldi,et al.  Supervising the New with the Old: Learning SFM from SFM , 2018, ECCV.

[41]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Renjie Liao,et al.  GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Nicu Sebe,et al.  Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Daniel J. Arrigo,et al.  An Introduction to Partial Differential Equations , 2017, An Introduction to Partial Differential Equations.

[45]  Hongdong Li,et al.  Monocular Dense 3D Reconstruction of a Complex Dynamic Scene from Two Perspective Frames , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Nicu Sebe,et al.  Multi-scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[49]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[50]  Weifeng Chen,et al.  Single-Image Depth Perception in the Wild , 2016, NIPS.

[51]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[53]  William T. Freeman,et al.  Learning Ordinal Relationships for Mid-Level Vision , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Carlos Hernandez,et al.  Multi-View Stereo: A Tutorial , 2015, Found. Trends Comput. Graph. Vis..

[57]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58]  Chunping Hou,et al.  A depth estimating method from a single image using FoE CRF , 2015, Multimedia Tools and Applications.

[59]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[60]  Lothar Reichel,et al.  Tridiagonal Toeplitz matrices: properties and novel applications , 2013, Numer. Linear Algebra Appl..

[61]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[62]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[63]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Carl E. Rasmussen,et al.  Sparse Spectrum Gaussian Process Regression , 2010, J. Mach. Learn. Res..

[65]  Richard Szeliski,et al.  Building Rome in a day , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[66]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[68]  Ashutosh Saxena,et al.  Learning 3-D Scene Structure from a Single Still Image , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[69]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[70]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, ACM Trans. Graph..

[71]  J. Rubinstein,et al.  An Introduction to Partial Differential Equations , 2005 .

[72]  N. Weiss A Course in Probability , 2005 .

[73]  William H. Press,et al.  Numerical recipes in C , 2002 .

[74]  C. Fonseca,et al.  Explicit inverses of some tridiagonal matrices , 2001 .

[75]  Tucker R. Balch,et al.  Merging Gaussian Distributions for Object Localization in Multi-robot Systems , 2000, ISER.

[76]  S. Ullman The interpretation of structure from motion , 1979, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[77]  A. Ng,et al.  Make3D: Learning 3D Scene Structure from a Single Still Image , 2022 .