PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities, yet very few works have addressed their capabilities in multi-modality settings. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world, and explore their meaningful interactions. To improve upon the cross-modal synergy in existing works, we propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects. Specifically, we first notice the importance of masking strategies between the two sources and utilize a projection module to complementarily align the mask and visible tokens of the two modalities. Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared decoder to promote cross-modality interaction in the mask tokens. Finally, we design a unique cross-modal reconstruction module to enhance representation learning for both modalities. Through extensive experiments performed on large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we discover it is nontrivial to interactively learn point-image features, where we greatly improve multiple 3D detectors, 2D detectors, and few-shot classifiers by 2.9%, 6.7%, and 2.4%, respectively. Code is available at https://github.com/BLVLab/PiMAE.

[1]  Jianbo Shi,et al.  Parameter is Not All You Need: Starting from Non-Parametric Networks for 3D Point Cloud Analysis , 2023, ArXiv.

[2]  Hongsheng Li,et al.  Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking , 2023, Int. J. Comput. Vis..

[3]  Raoul de Charette,et al.  Cross-Modal Learning for Domain Adaptation in 3D Semantic Segmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Xin Xu,et al.  TiG-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning , 2022, ArXiv.

[5]  Hongsheng Li,et al.  Learning 3D Representations from 2D Pre-Trained Models via Image-to-Point Masked Autoencoders , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Shanghang Zhang,et al.  PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning , 2022, ArXiv.

[7]  Zhiqiang Shen,et al.  i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? , 2022, ArXiv.

[8]  Xuming He,et al.  CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention , 2022, AAAI.

[9]  Chen Shi,et al.  Boosting 3D Object Detection via Object-Focused Image Fusion , 2022, ArXiv.

[10]  Renrui Zhang,et al.  Can Language Understand Depth? , 2022, ACM Multimedia.

[11]  Hongsheng Li,et al.  Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training , 2022, NeurIPS.

[12]  S. Levine,et al.  Multimodal Masked Autoencoders Learn Transferable Representations , 2022, ArXiv.

[13]  Jifeng Dai,et al.  ConvMAE: Masked Convolution Meets Masked Autoencoders , 2022, ArXiv.

[14]  A. Zamir,et al.  MultiMAE: Multi-modal Multi-task Masked Autoencoders , 2022, ECCV.

[15]  Manning Wang,et al.  POS-BERT: Point Cloud One-Stage BERT Pre-Training , 2022, ArXiv.

[16]  Francis E. H. Tay,et al.  Masked Autoencoders for Point Cloud Self-supervised Learning , 2022, ECCV.

[17]  R. Rodrigo,et al.  CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bolei Zhou,et al.  SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations , 2021, AAAI.

[19]  Peng Gao,et al.  PointCLIP: Point Cloud Understanding by CLIP , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  D. Rukhovich,et al.  FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection , 2021, ECCV.

[21]  Han Hu,et al.  SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Anton Konushin,et al.  ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[24]  Yanmin Wu,et al.  EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning , 2022, ArXiv.

[25]  Christoffer Petersson,et al.  Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds , 2022, ArXiv.

[26]  B. Dai,et al.  Voxel-MAE: Masked Autoencoders for Pre-training Large-scale Point Clouds , 2022, ArXiv.

[27]  Rohit Girdhar,et al.  An End-to-End Transformer Model for 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Han Zhao,et al.  Graph Debiased Contrastive Learning with Joint Representation Clustering , 2021, IJCAI.

[29]  Yulan Guo,et al.  Sparse-to-dense Feature Matching: Intra and Inter domain Cross-modal Learning in Domain Adaptation for 3D Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Xiaokang Yang,et al.  PointAugmenting: Cross-Modal Augmentation for 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yan Peng,et al.  Dual-stream Network for Visual Recognition , 2021, NeurIPS.

[32]  Saining Xie,et al.  Pri3D: Can 3D Priors Help 2D Representation Learning? , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Zheng Zhang,et al.  Group-Free 3D Object Detection via Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  David A. Ross,et al.  AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Rohit Girdhar,et al.  Self-Supervised Pretraining of 3D Features on any Point-Cloud , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Saining Xie,et al.  Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Xiaogang Wang,et al.  End-to-End Object Detection with Adaptive Clustering Transformer , 2020, BMVC.

[39]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[40]  Eunho Yang,et al.  Unbiased Classification through Bias-Contrastive and Bias-Balanced Learning , 2021, NeurIPS.

[41]  Thomas Funkhouser,et al.  P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding , 2020, ArXiv.

[42]  Leonidas J. Guibas,et al.  PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding , 2020, ECCV.

[43]  Ching-Yao Chuang,et al.  Debiased Contrastive Learning , 2020, NeurIPS.

[44]  Haitao Yang,et al.  H3DNet: 3D Object Detection Using Hybrid Geometric Primitives , 2020, ECCV.

[45]  Silvio Savarese,et al.  Generative Sparse Detection Networks for 3D Single-shot Object Detection , 2020, ECCV.

[46]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[47]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[48]  Xin Zhao,et al.  TANet: Robust 3D Object Detection from Point Clouds with Triple Attention , 2019, AAAI.

[49]  Leonidas J. Guibas,et al.  Deep Hough Voting for 3D Object Detection in Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Matthias Nießner,et al.  3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Xiaogang Wang,et al.  PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Luca Bertinetto,et al.  Meta-learning with differentiable closed-form solvers , 2018, ICLR.

[53]  Bin Yang,et al.  Deep Continuous Fusion for Multi-sensor 3D Object Detection , 2018, ECCV.

[54]  Bin Xu,et al.  Multi-level Fusion Based 3D Object Detection from Monocular Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Alexandre Lacoste,et al.  TADAM: Task dependent adaptive metric for improved few-shot learning , 2018, NeurIPS.

[56]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[57]  Danfei Xu,et al.  PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Bernard Ghanem,et al.  2D-Driven 3D Object Detection in RGB-D Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[63]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[64]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[68]  Jianxiong Xiao,et al.  Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[71]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.