Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

While convolutional neural networks have shown a tremendous impact on various computer vision tasks, they generally demonstrate limitations in explicitly modeling long-range dependencies due to the intrinsic locality of the convolution operation. Initially designed for natural language processing tasks, Transformers have emerged as alternative architectures with innate global self-attention mechanisms to capture long-range dependencies. In this paper, we propose TransDepth, an architecture that benefits from both convolutional neural networks and transformers. To avoid the network losing its ability to capture locallevel details due to the adoption of transformers, we propose a novel decoder that employs attention mechanisms based on gates. Notably, this is the first paper that applies transformers to pixel-wise prediction problems involving continuous labels (i.e., monocular depth prediction and surface normal estimation). Extensive experiments demonstrate that the proposed TransDepth achieves state-of-theart performance on three challenging datasets. Our code is available at: https://github.com/ygjwd12345/ TransDepth.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Yifan Wang,et al.  CLIFFNet for Monocular Depth Estimation with Hierarchical Embedding Loss , 2020, ECCV.

[3]  Nicu Sebe,et al.  Multi-Channel Attention Selection GAN With Cascaded Semantic Guidance for Cross-View Image Translation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Chunhua Shen,et al.  Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video , 2019, NeurIPS.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Alan L. Yuille,et al.  SURGE: Surface Regularized Geometry Estimation from a Single Image , 2016, NIPS.

[7]  Stergios I. Roumeliotis,et al.  Surface Normal Estimation of Tilted Images via Spatial Rectifier , 2020, ECCV.

[8]  Zejian Yuan,et al.  End-to-end Lane Shape Prediction with Transformers , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Chang Shu,et al.  Feature-metric Loss for Self-supervised Learning of Depth and Egomotion , 2020, ECCV.

[10]  Richard Bowden,et al.  DeFeat-Net: General Monocular Depth via Simultaneous Unsupervised Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Nicu Sebe,et al.  Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[15]  Jun Li,et al.  A Two-Streamed Network for Estimating Fine-Scaled Depth Maps from Single RGB Images , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Xiangyu Zhang,et al.  Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[19]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Nicu Sebe,et al.  XingGAN for Person Image Generation , 2020, ECCV.

[22]  Chunhua Shen,et al.  Enforcing Geometric Constraints of Virtual Normal for Depth Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jan-Michael Frahm,et al.  VPLNet: Deep Single View Normal Estimation With Vanishing Points and Lines , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Guosheng Lin,et al.  Deep convolutional neural fields for depth estimation from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[27]  Jiri Matas,et al.  Guiding Monocular Depth Estimation Using Depth-Attention Volume , 2020, ECCV.

[28]  Hugo Latapie,et al.  Cross-View Exocentric to Egocentric Video Synthesis , 2021, ACM Multimedia.

[29]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Ling Shao,et al.  Cloth Interactive Transformer for Virtual Try-On , 2021, ArXiv.

[33]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[34]  Matthew Fisher,et al.  UprightNet: Geometry-Aware Camera Orientation Estimation From Single Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Stephen Lin,et al.  Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Nicu Sebe,et al.  PAD-Net: Multi-tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Martial Hebert,et al.  Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Bingbing Zhuang,et al.  Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction , 2020, ECCV.

[39]  Alan L. Yuille,et al.  Towards unified depth and semantic prediction from a single image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jae-Han Lee,et al.  Multi-loss Rebalancing Algorithm for Monocular Depth Estimation , 2020, ECCV.

[41]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[42]  Leonidas J. Guibas,et al.  FrameNet: Learning Local Canonical Frames of 3D Surfaces From a Single RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Nicu Sebe,et al.  Probabilistic Graph Attention Network With Conditional Kernels for Pixel-Wise Prediction , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  A. Yuille,et al.  Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation , 2020, ECCV.

[49]  Tim Fingscheidt,et al.  Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance , 2020, ECCV.

[50]  Nicu Sebe,et al.  Dual Attention GANs for Semantic Image Synthesis , 2020, ACM Multimedia.

[51]  Il Hong Suh,et al.  From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation , 2019, ArXiv.

[52]  Lorenzo Bruzzone,et al.  Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images , 2021, ArXiv.

[53]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Nicu Sebe,et al.  Multi-scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Munchurl Kim,et al.  Forget About the LiDAR: Self-Supervised Depth Estimators with MED Probability Volumes , 2020, NeurIPS.

[56]  Renjie Liao,et al.  GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Bin Cheng,et al.  S3Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data , 2020, ECCV.

[58]  Gustavo Carneiro,et al.  Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Rares Ambrus,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Yan Wang,et al.  TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation , 2021, ArXiv.

[61]  Abhinav Gupta,et al.  Marr Revisited: 2D-3D Alignment via Surface Normal Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Zhihao Xia,et al.  Generating and Exploiting Probabilistic Monocular Depth Estimates , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[64]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[65]  Michael J. Black,et al.  Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[67]  Weifeng Chen,et al.  Surface Normals in the Wild , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Rares Ambrus,et al.  Semantically-Guided Representation Learning for Self-Supervised Monocular Depth , 2020, ICLR.

[69]  Xiaogang Wang,et al.  Context Encoding for Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[70]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).