Geometric-aware Pretraining for Vision-centric 3D Object Detection

Multi-camera 3D object detection for autonomous driving is a challenging problem that has garnered notable attention from both academia and industry. An obstacle encountered in vision-based techniques involves the precise extraction of geometry-conscious features from RGB images. Recent approaches have utilized geometric-aware image backbones pretrained on depth-relevant tasks to acquire spatial information. However, these approaches overlook the critical aspect of view transformation, resulting in inadequate performance due to the misalignment of spatial knowledge between the image backbone and view transformation. To address this issue, we propose a novel geometric-aware pretraining framework called GAPretrain. Our approach incorporates spatial and structural cues to camera networks by employing the geometric-rich modality as guidance during the pretraining phase. The transference of modal-specific attributes across different modalities is non-trivial, but we bridge this gap by using a unified bird's-eye-view (BEV) representation and structural hints derived from LiDAR point clouds to facilitate the pretraining process. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. Our experiments demonstrate the effectiveness and generalization ability of the proposed method. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively. We also conduct experiments on various image backbones and view transformations to validate the efficacy of our approach. Code will be released at https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe.

[1]  Junchi Yan,et al.  Policy Pre-training for Autonomous Driving via Self-supervised Geometric Modeling , 2023, ICLR.

[2]  Shanghang Zhang,et al.  BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for BEV 3D Object Detection , 2022, ArXiv.

[3]  Shiquan Zhang,et al.  BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection , 2022, ICLR.

[4]  Jinhyung D. Park,et al.  Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection , 2022, ICLR.

[5]  Junchi Yan,et al.  Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe , 2022, ArXiv.

[6]  Zeming Li,et al.  STS: Surround-view Temporal Stereo for Multi-view 3D Detection , 2022, ArXiv.

[7]  Junchi Yan,et al.  ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning , 2022, ECCV.

[8]  Jin Gao,et al.  PolarFormer: Multi-camera 3D Object Detection with Polar Transformers , 2022, AAAI.

[9]  Zeming Li,et al.  BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection , 2022, AAAI.

[10]  Junchi Yan,et al.  Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline , 2022, NeurIPS.

[11]  Jian Sun,et al.  PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Jiaya Jia,et al.  Unifying Voxel-based Representation with Transformer for 3D Object Detection , 2022, NeurIPS.

[13]  Geonwoo Baek,et al.  itKD: Interchange Transfer-based Knowledge Distillation for 3D Object Detection , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Xiaojuan Qi,et al.  Towards Efficient 3D Object Detection with Knowledge Distillation , 2022, NeurIPS.

[15]  Kaicheng Yu,et al.  BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework , 2022, NeurIPS.

[16]  Kaisheng Ma,et al.  PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jiwen Lu,et al.  BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving , 2022, ArXiv.

[18]  Jifeng Dai,et al.  BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers , 2022, ECCV.

[19]  Junjie Huang,et al.  BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection , 2022, ArXiv.

[20]  Spyros Gidaris,et al.  Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Chiew-Lan Tai,et al.  TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Junchi Yan,et al.  PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark , 2022, ECCV.

[23]  Jian Sun,et al.  PETR: Position Embedding Transformation for Multi-View 3D Object Detection , 2022, ECCV.

[24]  Wanli Ouyang,et al.  MonoDistill: Learning Spatial Features for Monocular 3D Object Detection , 2022, ICLR.

[25]  Yuan Gong,et al.  Focal and Global Knowledge Distillation for Detectors , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Deng Cai,et al.  Lidar Point Cloud Guided Monocular 3D Object Detection , 2021, ECCV.

[27]  Dalong Du,et al.  BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View , 2021, ArXiv.

[28]  Xiangyu Zhang,et al.  Instance-Conditional Knowledge Distillation for Object Detection , 2021, NeurIPS.

[29]  Yilun Wang,et al.  DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries , 2021, CoRL.

[30]  Hongsheng Li,et al.  LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-based 3D Detector , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Rares Ambrus,et al.  Is Pseudo-Lidar needed for Monocular 3D Object detection? , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Saining Xie,et al.  Pri3D: Can 3D Priors Help 2D Representation Learning? , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Xinge Zhu,et al.  FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[34]  Winston H. Hsu,et al.  Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining , 2021, ArXiv.

[35]  Erjin Zhou,et al.  General Instance Distillation for Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Rohit Girdhar,et al.  Self-Supervised Pretraining of 3D Features on any Point-Cloud , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Chunhua Shen,et al.  Channel-wise Knowledge Distillation for Dense Prediction* , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[39]  Leonidas J. Guibas,et al.  PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding , 2020, ECCV.

[40]  Dragomir Anguelov,et al.  Scalability in Perception for Autonomous Driving: Waymo Open Dataset , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Zhi Tang,et al.  CBNet: A Novel Composite Backbone Network Architecture for Object Detection , 2019, AAAI.

[42]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jiashi Feng,et al.  Distilling Object Detectors With Fine-Grained Feature Imitation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jongyoul Park,et al.  An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[45]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Bo Li,et al.  SECOND: Sparsely Embedded Convolutional Detection , 2018, Sensors.

[47]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Tony X. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[52]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[53]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[54]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.