Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

We address the problem of 3D object detection from 2D monocular images in autonomous driving scenarios. We propose to lift the 2D images to 3D representations using learned neural networks and leverage existing networks working directly on 3D data to perform 3D object detection and localization. We show that, with carefully designed training mechanism and automatically selected minimally noisy data, such a method is not only feasible, but gives higher results than many methods working on actual 3D inputs acquired from physical sensors. On the challenging KITTI benchmark, we show that our 2D to 3D lifted method outperforms many recent competitive 3D networks while significantly outperforming previous state-of-the-art for 3D detection from monocular images. We also show that a late fusion of the output of the network trained on generated 3D images, with that trained on real 3D images, improves performance. We find the results very interesting and argue that such a method could serve as a highly reliable backup in case of malfunction of expensive 3D sensors, if not potentially making them redundant, at least in the case of low human injury risk autonomous navigation scenarios like warehouse automation.

[1]  Bo Yang,et al.  3D Object Reconstruction from a Single Depth View with Adversarial Learning , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[2]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[3]  Yi Fang,et al.  Learning Adversarial 3 D Model Generation with 2 D Image Enhancer , 2018 .

[4]  Kazunori Ohno,et al.  Vehicle detection and localization on bird's eye view elevation images using convolutional neural network , 2017, 2017 IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR).

[5]  Ji Wan,et al.  Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Sanja Fidler,et al.  Monocular 3D Object Detection for Autonomous Driving , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Fernando García,et al.  BirdNet: A 3D Object Detection Framework from LiDAR Information , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[8]  Vassilios Morellas,et al.  Accurate 3D ground plane estimation from a single image , 2009, 2009 IEEE International Conference on Robotics and Automation.

[9]  Sanja Fidler,et al.  3D Object Proposals Using Stereo Imagery for Accurate Object Class Detection , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Sunglok Choi,et al.  Robust ground plane detection from 3D point clouds , 2014, International Conference on Control, Automation and Systems.

[11]  Loong Fah Cheong,et al.  Degeneracy in Self-Calibration Revisited and a Deep Learning Solution for Uncalibrated SLAM , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[12]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Honglak Lee,et al.  Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[14]  Gregory D. Hager,et al.  Deep Supervision with Intermediate Concepts , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Thierry Chateau,et al.  Deep MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis from Monocular Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Vincent Frémont,et al.  Color-based road detection and its evaluation on the KITTI road benchmark , 2014, 2014 IEEE Intelligent Vehicles Symposium Proceedings.

[17]  K. Madhava Krishna,et al.  Shape priors for real-time monocular object localization in dynamic environments , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Alexei A. Efros,et al.  Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[20]  Huimin Ma,et al.  3D Object Proposals for Accurate Object Class Detection , 2015, NIPS.

[21]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[22]  Subhransu Maji,et al.  3D Shape Induction from 2D Views of Multiple Objects , 2016, 2017 International Conference on 3D Vision (3DV).

[23]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[24]  Silvio Savarese,et al.  Data-driven 3D Voxel Patterns for object category recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ersin Yumer,et al.  Transformation-Grounded Image Generation Network for Novel 3D View Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  David Meger,et al.  Improved Adversarial Systems for 3D Object Generation and Reconstruction , 2017, CoRL.

[29]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Shiyu Song,et al.  Joint SFM and detection cues for monocular 3D localization in road scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jing Zhu,et al.  Learning Adversarial 3D Model Generation With 2D Image Enhancer , 2018, AAAI Conference on Artificial Intelligence.

[32]  Bin Xu,et al.  Multi-level Fusion Based 3D Object Detection from Monocular Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[35]  Tian Xia,et al.  Vehicle Detection from 3D Lidar Using Fully Convolutional Network , 2016, Robotics: Science and Systems.

[36]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jason J. Corso,et al.  A Continuous Occlusion Model for Road Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).