Direct 3D Detection of Vehicles in Monocular Images with a CNN based 3D Decoder

In autonomous driving, the detection of objects like surrounding vehicles based on monocular RGB images is usually performed by 2D bounding box detectors. The resulting 2D objects can be used for a first coarse 3D position estimate but for a precise location, additional sensor data has to be taken into account. For further use in sensor fusion systems and environment maps it is preferable to detect objects, their orientation and dimensions directly in 3D coordinates. To address this 3D object detection task, we propose a direct 3D bounding box estimator which is realized as CNN decoder module and can be connected to most 2D object detectors like SSD[1], OverFeat[2], YOLO[3] and RetinaNet[4] or directly to CNN feature extractors like VGG [2] and ResNet [5]. The 3D parameters of the objects such as dimension and orientation are directly predicted by the CNN module. To successfully train this complex MultiNet architecture, a combination and modification of current loss functions is proposed. The fastest of the proposed network module combinations is capable of detecting objects in 3D camera coordinates at a frame rate of 28 fps.

[1]  Silvio Savarese,et al.  A coarse-to-fine model for 3D pose estimation and sub-category recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[3]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Luis Miguel Bergasa,et al.  Train Here, Deploy There: Robust Segmentation in Unseen Domains , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jana Kosecka,et al.  Fast Single Shot Detection and Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[7]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Johann Marius Zöllner,et al.  Automated Focal Loss for Image based Object Detection , 2019, 2020 IEEE Intelligent Vehicles Symposium (IV).

[10]  Thierry Chateau,et al.  Deep MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis from Monocular Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[17]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[18]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Joachim Denzler,et al.  MB-Net: MergeBoxes for Real-Time 3D Vehicles Detection , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[20]  Luc Van Gool,et al.  Efficient Non-Maximum Suppression , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[21]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[22]  Roberto Cipolla,et al.  MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving , 2016, 2018 IEEE Intelligent Vehicles Symposium (IV).

[23]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Fernando A. Mujica,et al.  An Empirical Evaluation of Deep Learning on Highway Driving , 2015, ArXiv.

[25]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Torsten Bertram,et al.  A Fast Multi-Task CNN for Spatial Understanding of Traffic Scenes , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[27]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[28]  Johann Marius Zöllner,et al.  Learning temporal features with CNNs for monocular visual ego motion estimation , 2017, 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC).

[29]  Adam Herout,et al.  BoxCars: Improving Fine-Grained Recognition of Vehicles Using 3-D Bounding Boxes in Traffic Surveillance , 2017, IEEE Transactions on Intelligent Transportation Systems.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[33]  Xiaogang Wang,et al.  PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.