Guiding Monocular Depth Estimation Using Depth-Attention Volume

Recovering the scene depth from a single image is an ill-posed problem that requires additional priors, often referred to as monocular depth cues, to disambiguate different 3D interpretations. In recent works, those priors have been learned in an end-to-end manner from large datasets by using deep neural networks. In this paper, we propose guiding depth estimation to favor planar structures that are ubiquitous especially in indoor environments. This is achieved by incorporating a non-local coplanarity constraint to the network with a novel attention mechanism called depth-attention volume (DAV). Experiments on two popular indoor datasets, namely NYU-Depth-v2 and ScanNet, show that our method achieves state-of-the-art depth estimation results while using only a fraction of the number of parameters needed by the competing methods.

[1]  Nicu Sebe,et al.  Pattern-Affinitive Propagation Across Depth, Surface Normal and Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Andrew Owens,et al.  SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Takayuki Okatani,et al.  Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps With Accurate Object Boundaries , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[6]  Thomas Brox,et al.  CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[8]  Jonathan T. Barron,et al.  A category-level 3-D object dataset: Putting the Kinect to work , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[9]  Zhiguo Cao,et al.  Deep attention-based classification network for robust depth prediction , 2018, ACCV.

[10]  Renjie Liao,et al.  GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Chang-Su Kim,et al.  Monocular Depth Estimation Using Relative Depth Maps , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[13]  Richard Szeliski Structure from motion , 2011 .

[14]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[17]  Xuejin Chen,et al.  Structure-Aware Residual Pyramid Network for Monocular Depth Estimation , 2019, IJCAI.

[18]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[19]  Shu Kong,et al.  Pixel-Wise Attentional Gating for Scene Parsing , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[21]  Rynson W. H. Lau,et al.  Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss , 2018, ECCV.

[22]  Nicu Sebe,et al.  Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[25]  Friedrich Fraundorfer,et al.  Evaluation of CNN-based Single-Image Depth Estimation Methods , 2018, ECCV Workshops.

[26]  Jan Kautz,et al.  PlaneRCNN: 3D Plane Detection and Reconstruction From a Single Image , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[29]  Shaodi You,et al.  Detail Preserving Depth Estimation from a Single Image Using Attention Guided Networks , 2018, 2018 International Conference on 3D Vision (3DV).

[30]  Bo Li,et al.  Monocular Depth Estimation with Hierarchical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference , 2017, Pattern Recognit..

[31]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[32]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Ersin Yumer,et al.  PlaneNet: Piece-Wise Planar Reconstruction from a Single RGB Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Sertac Karaman,et al.  FastDepth: Fast Monocular Depth Estimation on Embedded Systems , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[36]  Vincent Lepetit,et al.  SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[37]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[38]  Jungwon Lee,et al.  Deep Robust Single Image Depth Estimation Neural Network Using Scene Understanding , 2019, CVPR Workshops.

[39]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Chunhua Shen,et al.  Enforcing Geometric Constraints of Virtual Normal for Depth Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Ersin Yumer,et al.  Physically-Based Rendering for Indoor Scene Understanding Using Convolutional Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.