Temp-Frustum Net: 3D Object Detection with Temporal Fusion

3D object detection is a core component of automated driving systems. State-of-the-art methods fuse RGB imagery and LiDAR point cloud data frame-by-frame for 3D bounding box regression. However, frame-by-frame 3D object detection suffers from noise, field-of-view obstruction, and sparsity. We propose a novel Temporal Fusion Module (TFM) to use information from previous time-steps to mitigate these problems. First, a state-of-the-art frustum network extracts point cloud features from raw RGB and LiDAR point cloud data frame-by-frame. Then, our TFM module fuses these features with a recurrent neural network. As a result, 3D object detection becomes robust against single frame failures and transient occlusions. Experiments on the KITTI object tracking dataset show the efficiency of the proposed TFM, where we obtain 6%, 4%, and 6% improvements on Car, Pedestrian, and Cyclist classes, respectively, compared to frame-by-frame baselines. Furthermore, ablation studies reinforce that the subject of improvement is temporal fusion and show the effects of different placements of TFM in the object detection pipeline. Our code is open-source and available at https://github.com/emecercelik/Temp-Frustum-Net.git.

[1]  Menglong Zhu,et al.  Mobile Video Object Detection with Temporally-Aware Feature Maps , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Leonidas J. Guibas,et al.  FlowNet3D: Learning Scene Flow in 3D Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xinggang Wang,et al.  A Simple Baseline for Multi-Object Tracking , 2020, ArXiv.

[4]  Cewu Lu,et al.  Online Video Object Detection Using Association LSTM , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Xiaogang Wang,et al.  PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Vladlen Koltun,et al.  Tracking Objects as Points , 2020, ECCV.

[7]  Bolei Zhou,et al.  Temporal Pyramid Network for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ruigang Yang,et al.  LiDAR-Based Online 3D Video Object Detection With Graph-Based Message Passing and Spatiotemporal Transformer Attention , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Gerhard Rigoll,et al.  You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization , 2019, ArXiv.

[11]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[12]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[14]  Avideh Zakhor,et al.  3d Object Detection For Autonomous Driving Using Temporal Lidar Data , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[15]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Lei Zhang,et al.  Structure Aware Single-Stage 3D Object Detection From Point Cloud , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[19]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Kris Kitani,et al.  GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking With 2D-3D Multi-Feature Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Zhihai He,et al.  Spatially supervised recurrent convolutional neural networks for visual object tracking , 2016, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[23]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[24]  Zhichao Lu,et al.  RetinaTrack: Online Single Stage Joint Detection and Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Thomas Funkhouser,et al.  An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds , 2020, ECCV.

[27]  Yanan Sun,et al.  3DSSD: Point-Based 3D Single Stage Object Detector , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  V. Prisacariu,et al.  FlowNet3D++: Geometric Losses For Deep Scene Flow Estimation , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Xiaoyong Shen,et al.  STD: Sparse-to-Dense 3D Object Detector for Point Cloud , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[32]  Truong Q. Nguyen,et al.  Context Matters: Refining Object Detection in Video with Recurrent Neural Networks , 2016, BMVC.

[33]  Philipp Krähenbühl,et al.  Center-based 3D Object Detection and Tracking , 2020, ArXiv.

[34]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.