Multi-Modal Streaming 3D Object Detection

—Modern autonomous vehicles rely heavily on me- chanical LiDARs for perception. Current perception methods generally require 360 ◦ point clouds, collected sequentially as the LiDAR scans the azimuth and acquires consecutive wedge- shaped slices. The acquisition latency of a full scan ( ∼ 100 ms ) may lead to outdated perception which is detrimental to safe operation. Recent streaming perception works proposed directly processing LiDAR slices and compensating for the narrow field of view (FOV) of a slice by reusing features from preceding slices. These works, however, are all based on a single modality and require past information which may be outdated. Meanwhile, images from high-frequency cameras can support streaming models as they provide a larger FoV compared to a LiDAR slice. However, this difference in FoV complicates sensor fusion. To address this research gap, we propose an innovative camera-LiDAR streaming 3D object detection framework that uses camera images instead of past LiDAR slices to provide an up-to-date, dense, and wide context for streaming perception. The proposed method outperforms prior streaming models on the challenging NuScenes benchmark. It also outperforms powerful full-scan detectors while being much faster. Our method is shown to be robust to missing camera images, narrow LiDAR slices, and small camera-LiDAR miscalibration.

[1]  Anton Konushin,et al.  ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[2]  Long Chen,et al.  Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review , 2020, IEEE Transactions on Intelligent Transportation Systems.

[3]  Song Han,et al.  SemAlign: Annotation-Free Camera-LiDAR Calibration with Semantic Alignment Loss , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Mingkui Tan,et al.  Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Xiaokang Yang,et al.  PointAugmenting: Cross-Modal Augmentation for 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Dong Ye,et al.  LCCNet: LiDAR and Camera Self-Calibration using Cost Volume Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[7]  Ming Liu,et al.  PointMoSeg: Sparse Tensor-Based End-to-End Moving-Obstacle Segmentation in 3-D Lidar Point Clouds for Autonomous Driving , 2021, IEEE Robotics and Automation Letters.

[8]  Xuan Xiong,et al.  RangeDet: In Defense of Range View for LiDAR-based 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  R. Urtasun,et al.  Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Philipp Krähenbühl,et al.  Center-based 3D Object Detection and Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Sourabh Vora,et al.  PolarStream: Streaming Object Detection and Segmentation with Polar Pillars , 2021, NeurIPS.

[12]  Sergio Casas,et al.  StrObe: Streaming Object Detection from LiDAR Packets , 2020, CoRL.

[13]  Zhenyu Guo,et al.  RGGNet: Tolerance Aware LiDAR-Camera Online Calibration With Geometric Deep Learning and Generative Model , 2020, IEEE Robotics and Automation Letters.

[14]  Xiang Bai,et al.  EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection , 2020, ECCV.

[15]  Jiquan Ngiam,et al.  Streaming Object Detection for 3-D Point Clouds , 2020, ECCV.

[16]  Jun Won Choi,et al.  3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection , 2020, ECCV.

[17]  Alex H. Lang,et al.  PointPainting: Sequential Fusion for 3D Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Benjin Zhu,et al.  Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection , 2019, ArXiv.

[20]  Yin Zhou,et al.  MVX-Net: Multimodal VoxelNet for 3D Object Detection , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[21]  Carlos Vallespi-Gonzalez,et al.  LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Kai Chen,et al.  Hybrid Task Cascade for Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Bo Li,et al.  SECOND: Sparsely Embedded Convolutional Detection , 2018, Sensors.

[25]  Bin Yang,et al.  Deep Continuous Fusion for Multi-sensor 3D Object Detection , 2018, ECCV.

[26]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Juan I. Nieto,et al.  Motion-Based Calibration of Multimodal Sensor Extrinsics and Timing Offset Estimation , 2016, IEEE Transactions on Robotics.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.