Generative Adversarial Networks for Unsupervised Monocular Depth Prediction

Estimating depth from a single image is a very challenging and exciting topic in computer vision with implications in several application domains. Recently proposed deep learning approaches achieve outstanding results by tackling it as an image reconstruction task and exploiting geometry constraints (e.g., epipolar geometry) to obtain supervisory signals for training. Inspired by these works and compelling results achieved by Generative Adversarial Network (GAN) on image reconstruction and generation tasks, in this paper we propose to cast unsupervised monocular depth estimation within a GAN paradigm. The generator network learns to infer depth from the reference image to generate a warped target image. At training time, the discriminator network learns to distinguish between fake images generated by the generator and target frames acquired with a stereo rig. To the best of our knowledge, our proposal is the first successful attempt to tackle monocular depth estimation with a GAN paradigm and the extensive evaluation on CityScapes and KITTI datasets confirm that it enables to improve traditional approaches. Additionally, we highlight a major issue with data deployed by a standard evaluation protocol widely used in this field and fix this problem using a more reliable dataset recently made available by the KITTI evaluation benchmark.

[1]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[3]  John Flynn,et al.  Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Suchendra M. Bhandarkar,et al.  DepthNet: A Recurrent Neural Network Architecture for Monocular Depth Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Vladlen Koltun,et al.  Dense Monocular Depth Estimation in Complex Dynamic Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ce Liu,et al.  Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[11]  Luigi di Stefano,et al.  Geometry meets semantics for semi-supervised monocular depth estimation , 2018, ACCV.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Toby P. Breckon,et al.  Real-Time Monocular Depth Estimation Using Synthetic Data with Domain Adaptation via Image Style Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[16]  Carlos Hernandez,et al.  Multi-View Stereo: A Tutorial , 2015, Found. Trends Comput. Graph. Vis..

[17]  Xi Wang,et al.  High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth , 2014, GCPR.

[18]  H. Hirschmüller Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information , 2005, CVPR.

[19]  D. Scharstein,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001).

[20]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[21]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[25]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Qiong Yan,et al.  Cascade Residual Learning: A Two-Stage Convolutional Neural Network for Stereo Matching , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[27]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Chunhua Shen,et al.  Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[30]  Abhinav Gupta,et al.  Designing deep networks for surface normal estimation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Raquel Urtasun,et al.  Efficient Deep Learning for Stereo Matching , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yann LeCun,et al.  Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches , 2015, J. Mach. Learn. Res..

[34]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[35]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Alexei A. Efros,et al.  Generative Visual Manipulation on the Natural Image Manifold , 2016, ECCV.

[38]  Yann LeCun,et al.  Computing the stereo matching cost with a convolutional neural network , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[41]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[42]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[43]  Jörg Stückler,et al.  Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry , 2018, ECCV.

[44]  Liang Lin,et al.  Single View Stereo Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Ali Farhadi,et al.  Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks , 2016, ECCV.

[46]  Stefano Mattoccia,et al.  Learning Monocular Depth Estimation with Unsupervised Trinocular Assumptions , 2018, 2018 International Conference on 3D Vision (3DV).

[47]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Yann LeCun,et al.  Disentangling factors of variation in deep representation using adversarial training , 2016, NIPS.

[49]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Robert J. Woodham,et al.  Photometric method for determining surface orientation from multiple images , 1980 .

[51]  Jörg Stückler,et al.  Semi-Supervised Deep Learning for Monocular Depth Map Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[54]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[55]  Stefano Mattoccia,et al.  Towards Real-Time Unsupervised Monocular Depth Estimation on CPU , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).