LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection

LiDAR and camera are two common sensors to collect data in time for 3D object detection under the autonomous driving context. Though the complementary information across sensors and time has great potential of benefiting 3D perception, taking full advantage of sequential cross-sensor data still remains challenging. In this paper, we propose a novel LiDAR Image Fusion Transformer (LIFT) to model the mutual interaction relationship of cross-sensor data over time. LIFT learns to align the input 4D sequential cross-sensor data to achieve multi-frame multi-modal information aggregation. To alleviate computational load, we project both point clouds and images into the bird-eye-view maps to compute sparse grid-wise self-attention. LIFT also benefits from a cross-sensor and cross-time data augmentation scheme. We evaluate the proposed approach on the challenging nuScenes and Waymo datasets, where our LIFT performs well over the state-of-the-art and strong baselines.

[1]  Wengang Zhou,et al.  Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection for Autonomous Driving , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Xiaokang Yang,et al.  Cross-Modal 3D Object Detection and Tracking for Auto-Driving , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[3]  Michael S. Ryoo,et al.  4D-Net for Learned Multi-Modal Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Xiaokang Yang,et al.  PointAugmenting: Cross-Modal Augmentation for 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Philipp Krähenbühl,et al.  Towards Long-Form Video Understanding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Andreas Geiger,et al.  Multi-Modal Fusion Transformer for End-to-End Autonomous Driving , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jiquan Ngiam,et al.  3D-MAN: 3D Multi-frame Attention Network for Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Dragomir Anguelov,et al.  Offboard 3D Object Detection from Point Cloud Sequences , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Heng Wang,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[10]  Raquel Urtasun,et al.  Auto4D: Learning to Label 4D Objects from Sequential Point Clouds , 2021, ArXiv.

[11]  R. Urtasun,et al.  Perceive, Attend, and Drive: Learning Spatial Attention for Safe Self-Driving , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Efthymios Tzinis,et al.  Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds , 2020, ICLR.

[13]  Marco Cristani,et al.  Transformer Networks for Trajectory Forecasting , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[14]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Raquel Urtasun,et al.  End-to-end Contextual Perception and Prediction with Interaction Transformer , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Thomas Funkhouser,et al.  An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds , 2020, ECCV.

[17]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[18]  Xiang Bai,et al.  EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection , 2020, ECCV.

[19]  Jun Won Choi,et al.  3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection , 2020, ECCV.

[20]  Ruigang Yang,et al.  LiDAR-Based Online 3D Video Object Detection With Graph-Based Message Passing and Spatiotemporal Transformer Attention , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yanan Sun,et al.  3DSSD: Point-Based 3D Single Stage Object Detector , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Dragomir Anguelov,et al.  Scalability in Perception for Autonomous Driving: Waymo Open Dataset , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Alex H. Lang,et al.  PointPainting: Sequential Fusion for 3D Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jean Pierre Mercat,et al.  Multi-Head Attention for Multi-Modal Joint Vehicle Motion Forecasting , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Zhaoxin Li,et al.  STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Xiaoyong Shen,et al.  STD: Sparse-to-Dense 3D Object Detector for Point Cloud , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Bin Yang,et al.  Multi-Task Multi-Sensor Fusion for 3D Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Xiaogang Wang,et al.  PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[32]  Bin Yang,et al.  HDNET: Exploiting HD Maps for 3D Object Detection , 2018, CoRL.

[33]  Bo Li,et al.  SECOND: Sparsely Embedded Convolutional Detection , 2018, Sensors.

[34]  Bin Yang,et al.  Deep Continuous Fusion for Multi-sensor 3D Object Detection , 2018, ECCV.

[35]  Bin Yang,et al.  Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Bin Yang,et al.  PIXOR: Real-time 3D Object Detection from Point Clouds , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[38]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Trevor Darrell,et al.  Deep Layer Aggregation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Ji Wan,et al.  Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.