CompletionFormer: Depth Completion with Convolutions and Vision Transformers

Given sparse depths and the corresponding RGB images, depth completion aims at spatially propagating the sparse measurements throughout the whole image to get a dense depth prediction. Despite the tremendous progress of deep-learning-based depth completion methods, the locality of the convolutional layer or graph model makes it hard for the network to model the long-range relationship between pixels. While recent fully Transformer-based architecture has reported encouraging results with the global receptive field, the performance and efficiency gaps to the well-developed CNN models still exist because of its deteriorative local feature details. This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure. This hybrid architecture naturally benefits both the local connectivity of convolutions and the global context of the Transformer in one single model. As a result, our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods. Code is available at \url{https://github.com/youmi-zym/CompletionFormer}.

[1]  S. Mattoccia,et al.  MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer , 2022, 2022 International Conference on 3D Vision (3DV).

[2]  Kyeongha Rho,et al.  GuideFormer: Transformers for Image Guided Depth Completion , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Muhammad Zeshan Afzal,et al.  SemAttNet: Toward Attention-Based Semantic Aware Guided Depth Completion , 2022, IEEE Access.

[4]  Yong Liu,et al.  CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xianming Liu,et al.  DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation , 2022, Machine Intelligence Research.

[6]  Haoqiang Fan,et al.  Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Qianglong Zhong,et al.  Dynamic Spatial Propagation Network for Depth Completion , 2022, AAAI.

[8]  Sung Ju Hwang,et al.  MPViT: Multi-Path Vision Transformer for Dense Prediction , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Guangming Shi,et al.  Robust Depth Completion with Uncertainty-Driven Loss Functions , 2021, AAAI.

[10]  Jian Yang,et al.  RigNet: Repetitive Image Guided Network for Depth Completion , 2021, ECCV.

[11]  Kai Han,et al.  CMT: Convolutional Neural Networks Meet Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[13]  Yaowei Wang,et al.  Conformer: Local Features Coupling Global Representations for Visual Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Zhuowen Tu,et al.  Co-Scale Conv-Attentional Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Daniel Morris,et al.  Depth Completion with Twin Surface Extrapolation at Occlusion Boundaries , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Shihao Jiang,et al.  Learning to Estimate Hidden Motions with Global Motion Aggregation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Wolfram Burgard,et al.  Sparse Auxiliary Networks for Unified Monocular Depth Prediction and Completion , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Vladlen Koltun,et al.  Vision Transformers for Dense Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Bin Li,et al.  PENet: Towards Precise and Efficient Image Guided Depth Completion , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Mengmeng Wang,et al.  FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Depth Completion , 2020, AAAI.

[22]  Xingtong Liu,et al.  Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[24]  Dacheng Tao,et al.  Adaptive Context-Aware Multi-Modal Network for Depth Completion , 2020, IEEE Transactions on Image Processing.

[25]  Xin Li,et al.  Sparse-to-Dense Depth Completion Revisited: Sampling Strategy and Graph Construction , 2020, ECCV.

[26]  Kyungdon Joo,et al.  Non-Local Spatial Propagation Network for Depth Completion , 2020, ECCV.

[27]  Jian Yao,et al.  Deformable Spatial Propagation Networks For Depth Completion , 2020, International Conference on Information Photonics.

[28]  Nanning Zheng,et al.  Multiscale Adaptation Fusion Networks for Depth Completion , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[29]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[30]  Ruigang Yang,et al.  CSPN++: Learning Context and Resource Aware Convolutional Spatial Propagation Networks for Depth Completion , 2019, AAAI.

[31]  Jie Tang,et al.  Learning Guided Convolutional Network for Depth Completion , 2019, IEEE Transactions on Image Processing.

[32]  Suya You,et al.  Deep RGB-D Canonical Correlation Analysis For Sparse Depth Completion , 2019, NeurIPS.

[33]  Luc Van Gool,et al.  Sparse and Noisy LiDAR Completion with RGB Guidance and Uncertainty , 2019, 2019 16th International Conference on Machine Vision Applications (MVA).

[34]  M. Pollefeys,et al.  DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene From Sparse LiDAR Data and Single Color Image , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ruigang Yang,et al.  Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network , 2018, ECCV.

[36]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[37]  Sertac Karaman,et al.  Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[38]  Yinda Zhang,et al.  Deep Depth Completion of a Single RGB-D Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[41]  Jan Kautz,et al.  Learning Affinity via Spatial Propagation Networks , 2017, NIPS.

[42]  Sertac Karaman,et al.  Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[44]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[45]  Anders Grunnet-Jepsen,et al.  Intel RealSense Stereoscopic Depth Cameras , 2017, CVPR 2017.

[46]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[47]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[48]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  日向 俊二 Kinect for Windowsアプリを作ろう , 2012 .