Joint Attention Mechanisms for Monocular Depth Estimation With Multi-Scale Convolutions and Adaptive Weight Adjustment

Monocular depth estimation is a fundamental problem for various vision applications, and is therefore gaining increasing attention in the field of computer vision. Though a great improvement has been made thanks to the rapid progress of deep convolutional neural networks, depth estimation of the object at finer details remains an unsatisfactory issue, especially in complex scenes that has rich structure information. In this article, we proposed a deep end-to-end learning framework with the combination of multi-scale convolutions and joint attention mechanisms to tackle this challenge. Specifically, we firstly elaborately designed a lightweight up-convolution to generate multi-scale feature maps. Then we introduced an attention-based residual block to aggregate different feature maps in joint channel and spatial dimension, which could enhance the discriminant ability of feature fusion at finer details. Furthermore, we explored an effective adaptive weight adjustment strategy for the loss function to further improve the performance, which adjusts the weight of each loss term during training without additional hyper-parameters. The proposed framework was evaluated using challenging NYU Depth v2 and KITTI datasets. Experimental results demonstrated that the proposed approach is superior to most of the state-of-the-art methods.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Anita Sellent,et al.  Optimized aperture shapes for depth estimation , 2014, Pattern Recognit. Lett..

[3]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Chunhua Shen,et al.  Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Xuejin Chen,et al.  Structure-Aware Residual Pyramid Network for Monocular Depth Estimation , 2019, IJCAI.

[6]  Zhiguo Cao,et al.  Deep attention-based classification network for robust depth prediction , 2018, ACCV.

[7]  Mohammad Sohel Rahman,et al.  MultiResUNet : Rethinking the U-Net Architecture for Multimodal Biomedical Image Segmentation , 2019, Neural Networks.

[8]  Guosheng Lin,et al.  CRF Learning with CNN Features for Image Segmentation , 2015, Pattern Recognit..

[9]  David A. Forsyth,et al.  Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry , 2010, ECCV.

[10]  Angelo Cangelosi,et al.  Head pose estimation in the wild using Convolutional Neural Networks and adaptive gradient methods , 2017, Pattern Recognit..

[11]  Marinos Ioannides,et al.  In the wild image retrieval and clustering for 3D cultural heritage landmarks reconstruction , 2014, Multimedia Tools and Applications.

[12]  Junwei Han,et al.  Scene parsing using inference Embedded Deep Networks , 2016, Pattern Recognit..

[13]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[14]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[15]  Mingyan Jiang,et al.  Depth prediction from a single image based on non-parametric learning in the gradient domain , 2019 .

[16]  Takayuki Okatani,et al.  Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps With Accurate Object Boundaries , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Haitao Zhao,et al.  Attention-based context aggregation network for monocular depth estimation , 2019, International Journal of Machine Learning and Cybernetics.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Lior Wolf,et al.  Single Image Depth Estimation Trained via Depth From Defocus Cues , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Tianyi Xu,et al.  Attention-Based Dense Decoding Network for Monocular Depth Estimation , 2020, IEEE Access.

[21]  Alexander H. Liu,et al.  Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Stefano Mattoccia,et al.  Learning Monocular Depth Estimation Infusing Traditional Stereo Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  In So Kweon,et al.  Convolutional Block Attention Module , 2018, ECCV 2018.

[24]  Linzhuo Pang,et al.  Fully convolutional multi-scale dense networks for monocular depth estimation , 2019, IET Comput. Vis..

[25]  Lin Zhang,et al.  Super-Resolution for Monocular Depth Estimation With Multi-Scale Sub-Pixel Convolutions and a Smoothness Constraint , 2019, IEEE Access.

[26]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[27]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[30]  Wei-Ping Zhu,et al.  Multi-scale context for scene labeling via flexible segmentation graph , 2016, Pattern Recognit..

[31]  Nicu Sebe,et al.  Multi-scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Shu Kong,et al.  Pixel-wise Attentional Gating for Parsimonious Pixel Labeling , 2018, ArXiv.

[33]  Jefersson Alex dos Santos,et al.  Towards better exploiting convolutional neural networks for remote sensing scene classification , 2016, Pattern Recognit..

[34]  Rynson W. H. Lau,et al.  Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss , 2018, ECCV.

[35]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Yun Fu,et al.  Image Super-Resolution Using Very Deep Residual Channel Attention Networks , 2018, ECCV.

[37]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[38]  In-So Kweon,et al.  BAM: Bottleneck Attention Module , 2018, BMVC.

[39]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Jianmin Jiang,et al.  Gaussian Weighted Deep Modeling for Improved Depth Estimation in Monocular Images , 2019, IEEE Access.

[41]  Peter Wonka,et al.  High Quality Monocular Depth Estimation via Transfer Learning , 2018, ArXiv.

[42]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[44]  Bo Li,et al.  Monocular Depth Estimation with Hierarchical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference , 2017, Pattern Recognit..

[45]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[46]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  Fatih Porikli,et al.  Indoor Scene Understanding in 2.5/3D for Autonomous Agents: A Survey , 2018, IEEE Access.

[48]  Ali Farhadi,et al.  Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks , 2016, ECCV.

[49]  Nassir Navab,et al.  Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks , 2018, MICCAI.

[50]  Jian Yang,et al.  Deep hierarchical guidance and regularization learning for end-to-end depth estimation , 2018, Pattern Recognit..

[51]  David Ball,et al.  Farm Workers of the Future: Vision-Based Robotics for Broad-Acre Agriculture , 2017, IEEE Robotics & Automation Magazine.

[52]  William T. Freeman,et al.  Learning the Depths of Moving People by Watching Frozen People , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Il Hong Suh,et al.  From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation , 2019, ArXiv.

[54]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.