Deep Learning-Based Monocular Depth Estimation Methods—A State-of-the-Art Review

Monocular depth estimation from Red-Green-Blue (RGB) images is a well-studied ill-posed problem in computer vision which has been investigated intensively over the past decade using Deep Learning (DL) approaches. The recent approaches for monocular depth estimation mostly rely on Convolutional Neural Networks (CNN). Estimating depth from two-dimensional images plays an important role in various applications including scene reconstruction, 3D object-detection, robotics and autonomous driving. This survey provides a comprehensive overview of this research topic including the problem representation and a short description of traditional methods for depth estimation. Relevant datasets and 13 state-of-the-art deep learning-based approaches for monocular depth estimation are reviewed, evaluated and discussed. We conclude this paper with a perspective towards future research work requiring further investigation in monocular depth estimation challenges.

[1]  Hong Zhang,et al.  Lightweight Monocular Depth Estimation Model by Joint End-to-End Filter Pruning , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  D. Scharstein,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001).

[4]  Chunhua Shen,et al.  Enforcing Geometric Constraints of Virtual Normal for Depth Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Peter Wonka,et al.  High Quality Monocular Depth Estimation via Transfer Learning , 2018, ArXiv.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[9]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[10]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Il Hong Suh,et al.  From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation , 2019, ArXiv.

[13]  Janne Heikkilä,et al.  A four-step camera calibration procedure with implicit image correction , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Rares Ambrus,et al.  SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[15]  Ping Tan,et al.  BA-Net: Dense Bundle Adjustment Network , 2018, ICLR 2018.

[16]  Gabriel J. Brostow,et al.  Self-Supervised Monocular Depth Hints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Rita Cucchiara,et al.  POSEidon: Face-from-Depth for Driver Pose Estimation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Fisher Yu,et al.  3D Reconstruction from Accidental Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Long Chen,et al.  Augmented Reality for Depth Cues in Monocular Minimally Invasive Surgery , 2017, ArXiv.

[22]  Dacheng Tao,et al.  Geometry-Aware Symmetric Domain Adaptation for Monocular Depth Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[24]  日向 俊二 Kinect for Windowsアプリを作ろう , 2012 .

[25]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yasin Almalioglu,et al.  GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[29]  Sertac Karaman,et al.  FastDepth: Fast Monocular Depth Estimation on Embedded Systems , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[30]  S. Dwivedi,et al.  Obesity May Be Bad: Compressed Convolutional Networks for Biomedical Image Segmentation , 2020 .

[31]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[32]  Haitao Zhao,et al.  Attention-based context aggregation network for monocular depth estimation , 2019, International Journal of Machine Learning and Cybernetics.

[33]  Fabio Tozeto Ramos,et al.  Hilbert maps: Scalable continuous occupancy mapping with stochastic gradient descent , 2015, Robotics: Science and Systems.

[34]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[35]  Rares Ambrus,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[37]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[38]  Peter M. Corcoran,et al.  A Depth Map Post-Processing Approach Based on Adaptive Random Walk With Restart , 2016, IEEE Access.

[39]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Rares Ambrus,et al.  Semantically-Guided Representation Learning for Self-Supervised Monocular Depth , 2020, ICLR.

[41]  Zhengyou Zhang,et al.  A Flexible New Technique for Camera Calibration , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Vincent Lepetit,et al.  SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[45]  Christopher Joseph Pal,et al.  Learning Conditional Random Fields for Stereo , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Stefan Leutenegger,et al.  DeepFusion: Real-Time Dense 3D Reconstruction for Monocular SLAM using Single-View Depth and Gradient Predictions , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[47]  Gernot Riegler,et al.  Connecting the Dots: Learning Representations for Active Monocular Depth Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Wojciech Matusik,et al.  Structure and motion from scene registration , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Richard Szeliski,et al.  High-accuracy stereo depth maps using structured light , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[50]  Ali Farhadi,et al.  Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks , 2016, ECCV.

[51]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[53]  Hossein Javidnia Contributions to the measurement of depth in consumer imaging , 2018 .

[54]  Tal Hassner,et al.  Learn Stereo, Infer Mono: Siamese Networks for Self-Supervised, Monocular, Depth Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[55]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[56]  Hongdong Li,et al.  Projective Multiview Structure and Motion from Element-Wise Factorization , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Russell H. Taylor,et al.  Self-supervised Learning for Dense Depth Estimation in Monocular Endoscopy , 2018, OR 2.0/CARE/CLIP/ISIC@MICCAI.

[58]  Chang-Su Kim,et al.  Monocular Depth Estimation Using Relative Depth Maps , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Valdir Grassi,et al.  Sparse-to-Continuous: Enhancing Monocular Depth Estimation using Occupancy Maps , 2018, 2019 19th International Conference on Advanced Robotics (ICAR).

[60]  Jörg Stückler,et al.  Semi-Supervised Deep Learning for Monocular Depth Map Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Anelia Angelova,et al.  Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos , 2018, AAAI.

[62]  Weifeng Chen,et al.  Single-Image Depth Perception in the Wild , 2016, NIPS.

[63]  Peter Corcoran,et al.  Semiparallel deep neural network hybrid architecture: first application on depth from monocular camera , 2018, J. Electronic Imaging.

[64]  Stefano Mattoccia,et al.  Enhancing Self-Supervised Monocular Depth Estimation with Traditional Visual Odometry , 2019, 2019 International Conference on 3D Vision (3DV).

[65]  Nicu Sebe,et al.  Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Jia Deng,et al.  DeepV2D: Video to Depth with Differentiable Structure from Motion , 2018, ICLR.

[67]  Ken Chen,et al.  An approach for structured light system calibration , 2013, 2013 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems.

[68]  Rongrong Ji,et al.  Semi-Supervised Adversarial Monocular Depth Estimation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[70]  Jie Li,et al.  Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances , 2019, CoRL.

[71]  Alexander H. Liu,et al.  Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Stefano Mattoccia,et al.  Learning Monocular Depth Estimation Infusing Traditional Stereo Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Markus Lienkamp,et al.  SemanticDepth: Fusing Semantic Segmentation and Monocular Depth Estimation for Enabling Autonomous Driving in Roads without Lane Lines , 2019, Sensors.

[74]  Peter Corcoran,et al.  Accurate Depth Map Estimation from Small Motions , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[75]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[76]  Jaakko Lehtinen,et al.  Noise2Noise: Learning Image Restoration without Clean Data , 2018, ICML.

[77]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.