Rethinking Monocular Depth Estimation with Adversarial Training

Monocular depth estimation is an extensively studied computer vision problem with a vast variety of applications. Deep learning-based methods have demonstrated promise for both supervised and unsupervised depth estimation from monocular images. Most existing approaches treat depth estimation as a regression problem with a local pixel-wise loss function. In this work, we innovate beyond existing approaches by using adversarial training to learn a context-aware, non-local loss function. Such an approach penalizes the joint configuration of predicted depth values at the patch-level instead of the pixel-level, which allows networks to incorporate more global information. In this framework, the generator learns a mapping between RGB images and its corresponding depth map, while the discriminator learns to distinguish depth map and RGB pairs from ground truth. This conditional GAN depth estimation framework is stabilized using spectral normalization to prevent mode collapse when learning from diverse datasets. We test this approach using a diverse set of generators that include U-Net and joint CNN-CRF. We benchmark this approach on the NYUv2, Make3D and KITTI datasets, and observe that adversarial training reduces relative error by several fold, achieving state-of-the-art performance.

[1]  Joe W. Gray,et al.  SHIFT: speedy histopathological-to-immunofluorescent translation of whole slide images using conditional generative adversarial networks , 2018, Medical Imaging.

[2]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Kihwan Choi,et al.  Real-time image reconstruction for low-dose CT using deep convolutional generative adversarial networks (GANs) , 2018, Medical Imaging.

[4]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Faisal Mahmood,et al.  Deep learning and conditional random fields‐based depth estimation and topographical reconstruction from conventional endoscopy , 2017, Medical Image Anal..

[6]  Alan L. Yuille,et al.  Towards unified depth and semantic prediction from a single image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[8]  J. Aloimonos Shape from texture , 1988, Biological cybernetics.

[9]  Nicu Sebe,et al.  Monocular Depth Estimation Using Multi-Scale Continuous CRFs as Sequential Deep Networks , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Guosheng Lin,et al.  Deep convolutional neural fields for depth estimation from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Tomas Pfister,et al.  Learning from Simulated and Unsupervised Images through Adversarial Training , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[15]  Ming Yang,et al.  Conditional Generative Adversarial Network for Structured Domain Adaptation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Ali Borji,et al.  Cross-View Image Synthesis Using Conditional GANs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[18]  Le Hui,et al.  Stacked Conditional Generative Adversarial Networks for Jointly Learning Shadow Detection and Shadow Removal , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Chen Sun,et al.  DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks , 2018, ArXiv.

[20]  Thomas O. Binford,et al.  Depth from Edge and Intensity Based Stereo , 1981, IJCAI.

[21]  Faisal Mahmood,et al.  Unsupervised Reverse Domain Adaptation for Synthetic Medical Images via Adversarial Training , 2017, IEEE Transactions on Medical Imaging.

[22]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[24]  Ayan Chakrabarti,et al.  Depth and Deblurring from a Spectrally-Varying Depth-of-Field , 2012, ECCV.

[25]  Martin A. Fischler,et al.  Computational Stereo , 1982, CSUR.

[26]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Alexei A. Efros,et al.  Recovering Surface Layout from an Image , 2007, International Journal of Computer Vision.

[28]  Jun Li,et al.  A Two-Streamed Network for Estimating Fine-Scaled Depth Maps from Single RGB Images , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Nicu Sebe,et al.  Unsupervised Adversarial Depth Estimation Using Cycled Generative Networks , 2018, 2018 International Conference on 3D Vision (3DV).

[30]  Qionghai Dai,et al.  DEPT: Depth Estimation by Parameter Transfer for Single Still Images , 2014, ACCV.

[31]  Takeo Kanade,et al.  A Multiple-Baseline Stereo , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Rynson W. H. Lau,et al.  Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss , 2018, ECCV.

[33]  Nicu Sebe,et al.  Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[35]  Jörg Stückler,et al.  Semi-Supervised Deep Learning for Monocular Depth Map Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Camille Couprie,et al.  Semantic Segmentation using Adversarial Networks , 2016, NIPS 2016.

[37]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[38]  Ashutosh Saxena,et al.  Make3D: Learning 3D Scene Structure from a Single Still Image , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[41]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[42]  Daniel Cremers,et al.  FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture , 2016, ACCV.

[43]  Yann LeCun,et al.  DeSIGN: Design Inspiration from Generative Networks , 2018, ECCV Workshops.

[44]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[45]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[46]  Ali Farhadi,et al.  Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks , 2016, ECCV.

[47]  Georgios Tzimiropoulos,et al.  Super-FAN: Integrated Facial Landmark Localization and Super-Resolution of Real-World Low Resolution Faces in Arbitrary Poses with GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Daphna Weinshall,et al.  Application Of Qualitative Depth And Shape From Stereo , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[49]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[50]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[51]  Olivier D. Faugeras,et al.  Shape From Shading , 2006, Handbook of Mathematical Models in Computer Vision.

[52]  Sinisa Todorovic,et al.  Monocular Depth Estimation Using Neural Regression Forest , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[55]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Christopher Joseph Pal,et al.  The Importance of Skip Connections in Biomedical Image Segmentation , 2016, LABELS/DLMIA@MICCAI.

[57]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[58]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Sing Bing Kang,et al.  Depth Transfer: Depth Extraction from Videos Using Nonparametric Sampling , 2016 .

[60]  Shree K. Nayar,et al.  Shape from focus: an effective approach for rough surfaces , 1990, Proceedings., IEEE International Conference on Robotics and Automation.