4D-Net for Learned Multi-Modal Alignment

We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time. We are able to incorporate the 4D information by performing a novel dynamic connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints. Our approach outperforms the state-of-the-art and strong baselines on the Waymo Open Dataset. 4D-Net is better able to use motion cues and dense image information to detect distant objects more successfully. We will open source the code.

[1]  Andrew Y. Ng,et al.  End-to-End People Detection in Crowded Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Bin Yang,et al.  HDNET: Exploiting HD Maps for 3D Object Detection , 2018, CoRL.

[3]  Carlos Vallespi-Gonzalez,et al.  LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Bin Yang,et al.  Multi-Task Multi-Sensor Fusion for 3D Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xiaogang Wang,et al.  A discriminative deep model for pedestrian detection with occlusion handling , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.

[8]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Tian Xia,et al.  Vehicle Detection from 3D Lidar Using Fully Convolutional Network , 2016, Robotics: Science and Systems.

[10]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Dushyant Rao,et al.  Vote3Deep: Fast object detection in 3D point clouds using efficient convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Danfei Xu,et al.  PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Luc Van Gool,et al.  Pedestrian detection at 100 frames per second , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Dragomir Anguelov,et al.  Range Conditioned Dilated Convolutions for Scale Invariant 3D Object Detection , 2020, CoRL.

[16]  Bo Li,et al.  SECOND: Sparsely Embedded Convolutional Detection , 2018, Sensors.

[17]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Bernt Schiele,et al.  Ten Years of Pedestrian Detection, What Have We Learned? , 2014, ECCV Workshops.

[19]  Bin Yang,et al.  PIXOR: Real-time 3D Object Detection from Point Clouds , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Ming Yang,et al.  Temporal-Context Enhanced Detection of Heavily Occluded Pedestrians , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Oscar Beijbom,et al.  PointPainting: Sequential Fusion for 3D Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Peiyun Hu,et al.  What You See is What You Get: Exploiting Visibility for 3D Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Bernt Schiele,et al.  Kinematic 3D Object Detection in Monocular Video , 2020, ECCV.

[25]  Michael S. Ryoo,et al.  AssembleNet++: Assembling Modality Representations via Attention Connections , 2020, ECCV.

[26]  Honggang Zhang,et al.  Progressive Refinement Network for Occluded Pedestrian Detection , 2020, ECCV.

[27]  Ingmar Posner,et al.  Voting for Voting in Online Point Cloud Object Detection , 2015, Robotics: Science and Systems.

[28]  Thomas Funkhouser,et al.  An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds , 2020, ECCV.

[29]  Yin Zhou,et al.  End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds , 2019, CoRL.

[30]  Xiaogang Wang,et al.  PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yin Zhou,et al.  StarNet: Targeted Computation for Object Detection in Point Clouds , 2019, ArXiv.

[32]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[33]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[34]  Bin Yang,et al.  Deep Continuous Fusion for Multi-sensor 3D Object Detection , 2018, ECCV.

[35]  Xiaoming Liu,et al.  Illuminating Pedestrians via Simultaneous Detection and Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[37]  Dragomir Anguelov,et al.  STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jianxiong Xiao,et al.  Sliding Shapes for 3D Object Detection in Depth Images , 2014, ECCV.

[39]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[40]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[41]  Bo Li,et al.  3D fully convolutional network for vehicle detection in point cloud , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[42]  Dragomir Anguelov,et al.  Scalability in Perception for Autonomous Driving: Waymo Open Dataset , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Subhransu Maji,et al.  SPLATNet: Sparse Lattice Networks for Point Cloud Processing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Rogério Schmidt Feris,et al.  A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection , 2016, ECCV.

[45]  Cristiano Premebida,et al.  Pedestrian detection combining RGB and dense LIDAR data , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[46]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yue Wang,et al.  Pillar-based Object Detection for Autonomous Driving , 2020, ECCV.

[48]  Dariu Gavrila,et al.  A Multilevel Mixture-of-Experts Framework for Pedestrian Classification , 2011, IEEE Transactions on Image Processing.

[49]  Michael S. Ryoo,et al.  Tiny Video Networks: Architecture Search for Efficient Video Models , 2020 .

[50]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Ji Wan,et al.  Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Silvio Savarese,et al.  4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[54]  Michael S. Ryoo,et al.  AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures , 2019, ICLR.