To Complete or to Estimate, That is the Question: A Multi-Task Approach to Depth Completion and Monocular Depth Estimation

Robust three-dimensional scene understanding is now an ever-growing area of research highly relevant in many real-world applications such as autonomous driving and robotic navigation. In this paper, we propose a multi-task learning-based model capable of performing two tasks:- sparse depth completion (i.e. generating complete dense scene depth given a sparse depth image as the input) and monocular depth estimation (i.e. predicting scene depth from a single RGB image) via two sub-networks jointly trained end to end using data randomly sampled from a publicly available corpus of synthetic and real-world images. The first sub-network generates a sparse depth image by learning lower level features from the scene and the second predicts a full dense depth image of the entire scene, leading to a better geometric and contextual understanding of the scene and, as a result, superior performance of the approach. The entire model can be used to infer complete scene depth from a single RGB image or the second network can be used alone to perform depth completion given a sparse depth input. Using adversarial training, a robust objective function, a deep architecture relying on skip connections and a blend of synthetic and real-world training data, our approach is capable of producing superior high quality scene depth. Extensive experimental evaluation demonstrates the efficacy of our approach compared to contemporary state-of-the-art techniques across both problem domains.

[1]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Sertac Karaman,et al.  Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Heiko Hirschmüller,et al.  Stereo Processing by Semiglobal Matching and Mutual Information , 2008, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Alois Knoll,et al.  PM-Huber: PatchMatch with Huber Regularization for Stereo Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[6]  Michael Felsberg,et al.  Propagating Confidences through CNNs for Sparse Data Regression , 2018, BMVC.

[7]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[8]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[11]  Jitendra Malik,et al.  Depth from shading, defocus, and correspondence using light-field angular coherence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Michael Felsberg,et al.  Confidence Propagation through CNNs for Guided Sparse Depth Regression , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Mandar Kulkarni,et al.  Depth inpainting by tensor voting. , 2013, Journal of the Optical Society of America. A, Optics, image science, and vision.

[16]  Toby P. Breckon,et al.  Generative adversarial framework for depth filling via Wasserstein metric, cosine transform and domain transfer , 2019, Pattern Recognit..

[17]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Toby P. Breckon,et al.  Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled by a Multi-Task Geometric and Semantic Scene Understanding Approach , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xuming He,et al.  Discrete-Continuous Depth Estimation from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Toby P. Breckon,et al.  Real-Time Monocular Depth Estimation Using Synthetic Data with Domain Adaptation via Image Style Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Richard Szeliski,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, International Journal of Computer Vision.

[22]  Toby P. Breckon,et al.  Back to Butterworth - a Fourier basis for 3D surface relief hole filling within RGB-D imagery , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[23]  Yoshimitsu Aoki,et al.  Depth image enhancement using local tangent plane approximations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[25]  Yinda Zhang,et al.  Deep Depth Completion of a Single RGB-D Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ce Liu,et al.  Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Bastian Leibe,et al.  Exploring Spatial Context for 3D Semantic Segmentation of Point Clouds , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[29]  Toby P. Breckon,et al.  Extended Patch Prioritization for Depth Filling Within Constrained Exemplar-Based RGB-D Image Completion , 2018, ICIAR.

[30]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Sertac Karaman,et al.  Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[32]  Toby P. Breckon,et al.  Monocular Segment-Wise Depth: Monocular Depth Estimation Based on a Semantic Segmentation Prior , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[33]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Deng Cai,et al.  Depth Image Inpainting: Improving Low Rank Matrix Completion With Low Gradient Regularization , 2017, IEEE Transactions on Image Processing.

[35]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[36]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[37]  Luc Van Gool,et al.  Sparse and Noisy LiDAR Completion with RGB Guidance and Uncertainty , 2019, 2019 16th International Conference on Machine Vision Applications (MVA).

[38]  Jörg Stückler,et al.  Semi-Supervised Deep Learning for Monocular Depth Map Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Minh N. Do,et al.  Semantic Image Inpainting with Deep Generative Models , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Toby P. Breckon,et al.  A comparative review of plausible hole filling strategies in the context of scene depth image completion , 2018, Comput. Graph..

[43]  Simon Lucey,et al.  Deep Convolutional Compressed Sensing for LiDAR Depth Completion , 2018, ACCV.

[44]  Ali Farhadi,et al.  Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks , 2016, ECCV.

[45]  Jianfei Cai,et al.  T2Net: Synthetic-to-Realistic Translation for Solving Single-Image Depth Estimation Tasks , 2018, ECCV.

[46]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[47]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Robert J. Woodham,et al.  Photometric method for determining surface orientation from multiple images , 1980 .

[49]  Xaq Pitkow,et al.  Skip Connections Eliminate Singularities , 2017, ICLR.

[50]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[52]  Gaurav Sharma,et al.  Fusing structure from motion and lidar for dense accurate depth map estimation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Dacheng Tao,et al.  Geometry-Aware Symmetric Domain Adaptation for Monocular Depth Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[56]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Jianhua Wang,et al.  An improved edge detection algorithm for depth map inpainting , 2014 .

[58]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[59]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[60]  Camillo J. Taylor,et al.  DFuseNet: Deep Fusion of RGB and Sparse Depth Information for Image Guided Dense Depth Completion , 2019, 2019 IEEE Intelligent Transportation Systems Conference (ITSC).

[61]  Xuming He,et al.  Building Scene Models by Completing and Hallucinating Depth and Semantics , 2016, ECCV.

[62]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[63]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[64]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Toby P. Breckon,et al.  DepthComp: Real-time Depth Image Completion Based on Prior Semantic Scene Segmentation , 2017, BMVC.