论文信息 - VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection

VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection

In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Two major challenges prevail in VIC3D: 1) inherent calibration noise when fusing multi-view images, caused by time asynchrony across cameras; 2) information loss when projecting 2D features into 3D space. To address these issues, We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI). First, to fully exploit the holistic perspectives from both vehicles and infrastructure, we propose a Multi-scale Cross Attention (MCA) module that fuses infrastructure and vehicle features on selective multi-scales to correct the calibration noise introduced by camera asynchrony. Then, we design a Camera-aware Channel Masking (CCM) module that uses camera parameters as priors to augment the fused features. We further introduce a Feature Compression (FC) module with channel and spatial compression blocks to reduce the size of transmitted features for enhanced efficiency. Experiments show that VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C, significantly outperforming state-of-the-art early fusion and late fusion methods with comparable transmission cost.

[1] Pengpeng Liang,et al. BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[2] Siheng Chen,et al. Where2comm: Communication-Efficient Collaborative Perception via Spatial Confidence Maps , 2022, NeurIPS.

[3] Ruigang Yang,et al. Vision-Centric BEV Perception: A Survey , 2022, ArXiv.

[4] Zeming Li,et al. BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection , 2022, AAAI.

[5] P. Luo,et al. CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving , 2022, ArXiv.

[6] Jian Sun,et al. PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] Jiwen Lu,et al. BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving , 2022, ArXiv.

[8] Zaiqing Nie,et al. DAIR-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] S. Fidler,et al. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation , 2022, ArXiv.

[10] Jifeng Dai,et al. BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers , 2022, ECCV.

[11] Junjie Huang,et al. BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection , 2022, ArXiv.

[12] Lantao Liu,et al. Model-Agnostic Multi-Agent Perception Framework , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[13] Yilun Wang,et al. FUTR3D: A Unified Sensor Fusion Framework for 3D Detection , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14] Ming-Hsuan Yang,et al. V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer , 2022, ECCV.

[15] Jian Sun,et al. PETR: Position Embedding Transformation for Multi-View 3D Object Detection , 2022, ECCV.

[16] Siheng Chen,et al. V2X-Sim: Multi-Agent Collaborative Perception Dataset and Benchmark for Autonomous Driving , 2022, IEEE Robotics and Automation Letters.

[17] Xin Xia,et al. OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[18] Anton Konushin,et al. ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[19] Dalong Du,et al. BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View , 2021, ArXiv.

[20] Yilun Wang,et al. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries , 2021, CoRL.

[21] Xinge Zhu,et al. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[22] Steven L. Waslander,et al. Categorical Depth Distribution Network for Monocular 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Raquel Urtasun,et al. V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction , 2020, ECCV.

[24] Sanja Fidler,et al. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[25] Roberto Cipolla,et al. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Qiang Xu,et al. nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Matthieu Cord,et al. DiscoNet: Shapes Learning on Disconnected Manifolds for 3D Editing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29] Benjin Zhu,et al. Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection , 2019, ArXiv.

[30] Kai Chen,et al. MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[31] Hao Chen,et al. FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Jiong Yang,et al. PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Bo Li,et al. SECOND: Sparsely Embedded Convolutional Detection , 2018, Sensors.

[34] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Andreas Geiger,et al. Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..