DeepInteraction: 3D Object Detection via Modality Interaction

Existing top-performance 3D object detectors typically rely on the multi-modal fusion strategy. This design is however fundamentally restricted due to overlooking the modality-specific useful information and finally hampering the model performance. To address this limitation, in this work we introduce a novel modality interaction strategy where individual per-modality representations are learned and maintained throughout for enabling their unique characteristics to be exploited during object detection. To realize this proposed strategy, we design a DeepInteraction architecture characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Experiments on the large-scale nuScenes dataset show that our proposed method surpasses all prior arts often by a large margin. Crucially, our method is ranked at the first position at the highly competitive nuScenes object detection leaderboard.

[1]  Jin Gao,et al.  PolarFormer: Multi-camera 3D Object Detection with Polar Transformers , 2022, AAAI.

[2]  Xiatian Zhu,et al.  Learning Ego 3D Representation as Ray Tracing , 2022, ECCV.

[3]  Jiaya Jia,et al.  Voxel Field Fusion for 3D Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Kaicheng Yu,et al.  BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework , 2022, NeurIPS.

[5]  Huizi Mao,et al.  BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Jiaya Jia,et al.  Focal Sparse Convolutional Networks for 3D Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jifeng Dai,et al.  BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers , 2022, ECCV.

[8]  Chiew-Lan Tai,et al.  TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yilun Wang,et al.  FUTR3D: A Unified Sensor Fusion Framework for 3D Detection , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Quoc V. Le,et al.  DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jian Sun,et al.  PETR: Position Embedding Transformation for Multi-View 3D Object Detection , 2022, ECCV.

[12]  Bolei Zhou,et al.  AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection , 2022, IJCAI.

[13]  Hang Zhao,et al.  Embracing Single Stride 3D Object Detector with Sparse Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jiaya Jia,et al.  Scaling up Kernels in 3D CNNs , 2022, ArXiv.

[15]  Dalong Du,et al.  BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View , 2021, ArXiv.

[16]  Philipp Krähenbühl,et al.  Multimodal Virtual Point 3D Detection , 2021, NeurIPS.

[17]  Justin Solomon,et al.  Object DGCNN: 3D Object Detection using Dynamic Graphs , 2021, NeurIPS.

[18]  Yilun Wang,et al.  DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries , 2021, CoRL.

[19]  Rohit Girdhar,et al.  An End-to-End Transformer Model for 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Minzhe Niu,et al.  Voxel Transformer for 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Michael S. Ryoo,et al.  4D-Net for Learned Multi-Modal Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Xiaokang Yang,et al.  PointAugmenting: Cross-Modal Augmentation for 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jiquan Ngiam,et al.  To the Point: Efficient 3D Object Detection in the Range Image with Graph Convolution Kernels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Xuan Xiong,et al.  RangeDet: In Defense of Range View for LiDAR-based 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Philipp Krähenbühl,et al.  Center-based 3D Object Detection and Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Nuno Vasconcelos,et al.  Cascade R-CNN: High Quality Object Detection and Instance Segmentation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[29]  Xiang Bai,et al.  EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection , 2020, ECCV.

[30]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[31]  Dragomir Anguelov,et al.  Range Conditioned Dilated Convolutions for Scale Invariant 3D Object Detection , 2020, CoRL.

[32]  Leonidas J. Guibas,et al.  ImVoteNet: Boosting 3D Object Detection in Point Clouds With Image Votes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Alex H. Lang,et al.  PointPainting: Sequential Fusion for 3D Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[36]  Benjin Zhu,et al.  Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection , 2019, ArXiv.

[37]  Yan Wang,et al.  Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Bo Li,et al.  SECOND: Sparsely Embedded Convolutional Detection , 2018, Sensors.

[40]  Steven Lake Waslander,et al.  In Defense of Classical Image Processing: Fast Depth Completion on the CPU , 2018, 2018 15th Conference on Computer and Robot Vision (CRV).

[41]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[42]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[45]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.