TiG-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning

To achieve accurate and low-cost 3D object detection, existing methods propose to benefit camera-based multi-view detectors with spatial cues provided by the LiDAR modality, e.g., dense depth supervision and bird-eye-view (BEV) feature distillation. However, they directly conduct point-to-point mimicking from LiDAR to camera, which neglects the inner-geometry of foreground targets and suffers from the modal gap between 2D-3D features. In this paper, we propose the learning scheme of Target Inner-Geometry from the LiDAR modality into camera-based BEV detectors for both dense depth and BEV features, termed as TiG-BEV. First, we introduce an inner-depth supervision module to learn the low-level relative depth relations between different foreground pixels. This enables the camera-based detector to better understand the object-wise spatial structures. Second, we design an inner-feature BEV distillation module to imitate the high-level semantics of different keypoints within foreground targets. To further alleviate the BEV feature gap between two modalities, we adopt both inter-channel and inter-keypoint distillation for feature-similarity modeling. With our target inner-geometry distillation, TiG-BEV can effectively boost BEVDepth by +2.3% NDS and +2.4% mAP, along with BEVDet by +9.1% NDS and +10.3% mAP on nuScenes val set. Code will be available at https://github.com/ADLab3Ds/TiG-BEV.

[1]  Houqiang Li,et al.  Multi-Modal 3D Object Detection in Autonomous Driving: A Survey , 2021, International Journal of Computer Vision.

[2]  Hongsheng Li,et al.  Learning 3D Representations from 2D Pre-Trained Models via Image-to-Point Masked Autoencoders , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Shiquan Zhang,et al.  BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection , 2022, ICLR.

[4]  Yu Hong,et al.  Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection , 2022, ECCV.

[5]  Renrui Zhang,et al.  Can Language Understand Depth? , 2022, ACM Multimedia.

[6]  Jin Gao,et al.  PolarFormer: Multi-camera 3D Object Detection with Polar Transformers , 2022, AAAI.

[7]  Zeming Li,et al.  BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection , 2022, AAAI.

[8]  Jian Sun,et al.  PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Chen Change Loy,et al.  Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jiaya Jia,et al.  Unifying Voxel-based Representation with Transformer for 3D Object Detection , 2022, NeurIPS.

[11]  Jiwen Lu,et al.  SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation , 2022, CoRL.

[12]  Junjie Huang,et al.  BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection , 2022, ArXiv.

[13]  Jifeng Dai,et al.  BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers , 2022, ECCV.

[14]  Spyros Gidaris,et al.  Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jian Sun,et al.  PETR: Position Embedding Transformation for Multi-View 3D Object Detection , 2022, ECCV.

[16]  Hongsheng Li,et al.  Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning , 2022, ArXiv.

[17]  Bolei Zhou,et al.  AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection , 2022, IJCAI.

[18]  Zhenyu Wang,et al.  Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Rares Ambrus,et al.  Full Surround Monodepth From Multiple Cameras , 2021, IEEE Robotics and Automation Letters.

[20]  Dalong Du,et al.  BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View , 2021, ArXiv.

[21]  Yilun Wang,et al.  DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries , 2021, CoRL.

[22]  Xiaojun Chang,et al.  Exploring Inter-Channel Correlation for Diversity-preserved Knowledge Distillation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Rares Ambrus,et al.  Is Pseudo-Lidar needed for Monocular 3D Object detection? , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Xinge Zhu,et al.  Probabilistic and Geometric Depth: Detecting Objects in Perspective , 2021, CoRL.

[25]  Xinge Zhu,et al.  FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[26]  Winston H. Hsu,et al.  Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining , 2021, ArXiv.

[27]  Yuchao Dai,et al.  CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Vladlen Koltun,et al.  Vision Transformers for Dense Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Erjin Zhou,et al.  General Instance Distillation for Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Steven L. Waslander,et al.  Categorical Depth Distribution Network for Monocular 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Wengang Zhou,et al.  Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection , 2020, AAAI.

[32]  Peter Wonka,et al.  AdaBins: Depth Estimation Using Adaptive Bins , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[34]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[35]  Stefano Mattoccia,et al.  On the Uncertainty of Self-Supervised Monocular Depth Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Juan Carlos Niebles,et al.  Spatio-Temporal Graph for Video Captioning With Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Qiang Ji,et al.  Knowledge Augmented Deep Neural Networks for Joint Facial Expression and Action Unit Recognition , 2020, NeurIPS.

[39]  Xiang Bai,et al.  Intra-class Feature Variation Distillation for Semantic Segmentation , 2020, ECCV.

[40]  Benjin Zhu,et al.  Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection , 2019, ArXiv.

[41]  Greg Mori,et al.  Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Ruigang Yang,et al.  GA-Net: Guided Aggregation Net for End-To-End Stereo Matching , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Xiaogang Wang,et al.  PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Tony X. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[46]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[47]  Naiyan Wang,et al.  Like What You Like: Knowledge Distill via Neuron Selectivity Transfer , 2017, ArXiv.

[48]  Emanuele Menegatti,et al.  Fast and robust detection of fallen people from a mobile robot , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[49]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[51]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[52]  Martijn J. Schuemie,et al.  Research on Presence in Virtual Reality: A Survey , 2001, Cyberpsychology Behav. Soc. Netw..