GD-MAE: Generative Decoder for MAE Pre-Training on LiDAR Point Clouds

Despite the tremendous progress of Masked Autoencoders (MAE) in developing vision tasks such as image and video, exploring MAE in large-scale 3D point clouds remains challenging due to the inherent irregularity. In contrast to previous 3D MAE frameworks, which either design a complex decoder to infer masked information from maintained regions or adopt sophisticated masking strategies, we instead propose a much simpler paradigm. The core idea is to apply a \textbf{G}enerative \textbf{D}ecoder for MAE (GD-MAE) to automatically merges the surrounding context to restore the masked geometric knowledge in a hierarchical fusion manner. In doing so, our approach is free from introducing the heuristic design of decoders and enjoys the flexibility of exploring various masking strategies. The corresponding part costs less than \textbf{12\%} latency compared with conventional methods, while achieving better performance. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: Waymo, KITTI, and ONCE. Consistent improvement on downstream detection tasks illustrates strong robustness and generalization capability. Not only our method reveals state-of-the-art results, but remarkably, we achieve comparable accuracy even with \textbf{20\%} of the labeled data on the Waymo dataset. Code will be released at https://github.com/Nightmare-n/GD-MAE.

[1]  Wanli Ouyang,et al.  3D-QueryIS: A Query-based Framework for 3D Instance Segmentation , 2022, ArXiv.

[2]  Hengshuang Zhao,et al.  Point Transformer V2: Grouped Vector Attention and Partition-based Pooling , 2022, Neural Information Processing Systems.

[3]  Yu Wang,et al.  CenterFormer: Center-based Transformer for 3D Object Detection , 2022, ECCV.

[4]  Xiaofei He,et al.  Graph R-CNN: Towards Accurate 3D Object Detection with Semantic-Decorated Local Graph , 2022, ECCV.

[5]  Liangjun Zhang,et al.  ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object Detection , 2022, ECCV.

[6]  Zhaoxiang Zhang,et al.  Fully Sparse 3D Object Detection , 2022, NeurIPS.

[7]  M. Heizmann,et al.  A Closer Look at Invariances in Self-supervised Pre-training for 3D Vision , 2022, ECCV.

[8]  Jiaya Jia,et al.  Scaling up Kernels in 3D CNNs , 2022, ArXiv.

[9]  Hongsheng Li,et al.  Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training , 2022, NeurIPS.

[10]  Haoqi Fan,et al.  Masked Autoencoders As Spatiotemporal Learners , 2022, NeurIPS.

[11]  Jifeng Dai,et al.  ConvMAE: Masked Convolution Meets Masked Autoencoders , 2022, ArXiv.

[12]  Ming-Yu Liu,et al.  3D Object Detection with a Self-supervised Lidar Scene Flow Backbone , 2022, ECCV.

[13]  B. Schiele,et al.  RBGNet: Ray-based Grouping for 3D Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jiaya Jia,et al.  Stratified Transformer for 3D Point Cloud Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Wei Zhang,et al.  Point2Seq: Detecting 3D Objects as Sequences , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Limin Wang,et al.  VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training , 2022, NeurIPS.

[17]  Yong Jae Lee,et al.  Masked Discrimination for Self-Supervised Learning on Point Clouds , 2022, ECCV.

[18]  Yulan Guo,et al.  Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Lei Zhang,et al.  Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Xiaopei Wu,et al.  Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  K. Jia,et al.  VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Francis E. H. Tay,et al.  Masked Autoencoders for Point Cloud Self-supervised Learning , 2022, ECCV.

[23]  B. Schiele,et al.  A Unified Query-based Paradigm for Point Cloud Understanding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  R. Rodrigo,et al.  CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Zhe Chen,et al.  SASA: Semantics-Augmented Set Abstraction for Point-based 3D Object Detection , 2022, AAAI.

[26]  Yihan Hu,et al.  AFDetV2: Rethinking the Necessity of the Second Stage for Object Detection from Point Clouds , 2021, AAAI.

[27]  Hang Zhao,et al.  Embracing Single Stride 3D Object Detector with Sparse Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Minsu Cho,et al.  Fast Point Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  M. Nießner,et al.  4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding , 2021, ECCV.

[30]  Jiwen Lu,et al.  Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Dinesh Manocha,et al.  M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[33]  Minghao Chen,et al.  Suppress-and-Refine Framework for End-to-End 3d Object Detection , 2021, SSRN Electronic Journal.

[34]  Jianping Shi,et al.  PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection , 2021, International Journal of Computer Vision.

[35]  Christoffer Petersson,et al.  Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds , 2022, ArXiv.

[36]  Justin Solomon,et al.  Object DGCNN: 3D Object Detection using Dynamic Graphs , 2021, NeurIPS.

[37]  L. Gool,et al.  Exploring Geometry-aware Contrast and Clustering Harmonization for Self-supervised 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Minzhe Niu,et al.  Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Minzhe Niu,et al.  Voxel Transformer for 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Bing Deng,et al.  Improving 3D Object Detection with Channel-wise Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Xiaodan Liang,et al.  One Million Scenes for Autonomous Driving: ONCE Dataset , 2021, NeurIPS Datasets and Benchmarks.

[42]  Yang Wang,et al.  PVGNet: A Bottom-Up One-Stage 3D Object Detector with Integrated Multi-Level Features , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Cristian Sminchisescu,et al.  RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Chi-Wing Fu,et al.  SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Zheng Zhang,et al.  Group-Free 3D Object Detection via Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Jiquan Ngiam,et al.  3D-MAN: 3D Multi-frame Attention Network for Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Zhichao Li,et al.  LiDAR R-CNN: An Efficient and Universal 3D Object Detector , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Xuan Xiong,et al.  RangeDet: In Defense of Range View for LiDAR-based 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Rohit Girdhar,et al.  Self-Supervised Pretraining of 3D Features on any Point-Cloud , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Wengang Zhou,et al.  Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection , 2020, AAAI.

[51]  Gao Huang,et al.  3D Object Detection with Pointformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Li Jiang,et al.  CIA-SSD: Confident IoU-Aware Single-Stage Object Detector From Point Cloud , 2020, AAAI.

[53]  Klaus Dietmayer,et al.  Point Transformer , 2020, IEEE Access.

[54]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[55]  Matt J. Kusner,et al.  Unsupervised Point Cloud Pre-training via Occlusion Completion , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Xiaogang Wang,et al.  From Points to Parts: 3D Object Detection From Point Cloud With Part-Aware and Part-Aggregation Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Zili Liu,et al.  SparsePoint: Fully End-to-End Sparse 3D Object Detector , 2021, ArXiv.

[59]  Leonidas J. Guibas,et al.  PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding , 2020, ECCV.

[60]  Yue Wang,et al.  Pillar-based Object Detection for Autonomous Driving , 2020, ECCV.

[61]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[62]  Lei Zhang,et al.  Structure Aware Single-Stage 3D Object Detection From Point Cloud , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[64]  Weijing Shi,et al.  Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Shuangjie Xu,et al.  HVNet: Hybrid Voxel Network for LiDAR Based 3D Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Yanan Sun,et al.  3DSSD: Point-Based 3D Single Stage Object Detector , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Dragomir Anguelov,et al.  Scalability in Perception for Autonomous Driving: Waymo Open Dataset , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Yin Zhou,et al.  End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds , 2019, CoRL.

[69]  Xiaoyong Shen,et al.  STD: Sparse-to-Dense 3D Object Detector for Point Cloud , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  Leonidas J. Guibas,et al.  Deep Hough Voting for 3D Object Detection in Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[71]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Xiaogang Wang,et al.  PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[74]  Bo Li,et al.  SECOND: Sparsely Embedded Convolutional Detection , 2018, Sensors.

[75]  Bin Yang,et al.  PIXOR: Real-time 3D Object Detection from Point Clouds , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[76]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[77]  Shuigeng Zhou,et al.  DeepCluster: A General Clustering Framework Based on Deep Learning , 2017, ECML/PKDD.

[78]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[79]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.