Feature Map Transformation for Multi-sensor Fusion in Object Detection Networks for Autonomous Driving

We present a general framework for fusing pre-trained object detection networks for multiple sensor modalities in autonomous cars at an intermediate stage. The key innovation is an autoencoder-inspired Transformer module which transforms perspective as well as feature activation characteristics from one sensor modality to another. Transformed feature maps can be combined with those of a modality-native feature extractor to enhance performance and reliability through a simple fusion scheme. Our approach is not limited to specific object detection network types. Compared to other methods, our framework allows fusion of pre-trained object detection networks and fuses sensor modalities at a single stage, resulting in a modular and traceable architecture. We show effectiveness of the proposed scheme by fusing camera and Lidar information to detect objects using our own as well as the KITTI dataset.

[1]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Wei Zhan,et al.  Fusing Bird View LIDAR Point Cloud and Front View Camera Image for Deep Object Detection , 2017, ArXiv.

[6]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[7]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[8]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Ji Wan,et al.  Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Honglak Lee,et al.  Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[12]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.