SUW-Learn: Joint Supervised, Unsupervised, Weakly Supervised Deep Learning for Monocular Depth Estimation

We introduce SUW-Learn: A framework for deep-learning with joint supervised learning (S), unsupervised learning (U), and weakly-supervised learning (W). We deploy SUWLearn for deep learning of the monocular depth from images and video sequences. The supervised learning module optimizes a depth estimation network by knowledge of the ground-truth depth. In contrast, the unsupervised learning module has no knowledge of the ground-truth depth, but optimizes the depth estimation network by predicting the current frame from the estimated 3D geometry. The weakly supervised module optimizes the depth estimation by evaluating the consistency between the estimated depth and weak labels derived from other information, such as the semantic information. SUW-Learn trains the deep-learning networks end-to-end with joint optimization of the desired SUW objectives. We benchmark SUW-Learn on the commonly-used KITTI driving-scene and achieve the state-of-the-art performance. To demonstrate the capacity of SUW-Learn in learning the depth of scenes with people from different sources with different domain knowledge, we construct the M&M dataset from the Megadepth and Mannequin Challenge datasets.

[1]  Zhengqi Li,et al.  MegaDepth: Learning Single-View Depth Prediction from Internet Photos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Shahram Izadi,et al.  StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction , 2018, ECCV.

[3]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Ce Liu,et al.  Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Sertac Karaman,et al.  Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Il Hong Suh,et al.  From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation , 2019, ArXiv.

[7]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[8]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[9]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[10]  Rui Hu,et al.  Deep Rigid Instance Scene Flow , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Weifeng Chen,et al.  Single-Image Depth Perception in the Wild , 2016, NIPS.

[12]  Tara Javidi,et al.  SIGNet: Semantic Instance Aided Unsupervised 3D Geometry Perception , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[14]  Jungwon Lee,et al.  Deep Robust Single Image Depth Estimation Neural Network Using Scene Understanding , 2019, CVPR Workshops.

[15]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Anelia Angelova,et al.  Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Jungwon Lee,et al.  AMNet: Deep Atrous Multiscale Stereo Disparity Estimation Networks , 2019, ArXiv.

[19]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[22]  Honglak Lee,et al.  A Dynamic Bayesian Network Model for Autonomous 3D Reconstruction from a Single Indoor Image , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[23]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[25]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[26]  Ruigang Yang,et al.  Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network , 2018, ECCV.

[27]  William T. Freeman,et al.  Learning the Depths of Moving People by Watching Frozen People , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, ACM Trans. Graph..