6D-VNet: End-To-End 6DoF Vehicle Pose Estimation From Monocular RGB Images

We present a conceptually simple framework for 6DoF object pose estimation, especially for autonomous driving scenario. Our approach efficiently detects traffic participants in a monocular RGB image while simultaneously regressing their 3D translation and rotation vectors. The method, called 6D-VNet, extends Mask R-CNN by adding customised heads for predicting vehicle's finer class, rotation and translation. The proposed 6D-VNet is trained end-to-end compared to previous methods. Furthermore, we show that the inclusion of translational regression in the joint losses is crucial for the 6DoF pose estimation task, where object translation distance along longitudinal axis varies significantly, e.g., in autonomous driving scenarios. Additionally, we incorporate the mutual information between traffic participants via a modified non-local block. As opposed to the original non-local block implementation, the proposed weighting modification takes the spatial neighbouring information into consideration whilst counteracting the effect of extreme gradient values. Our 6D-VNet reaches the 1 st place in ApolloScape challenge 3D Car Instance task. Code has been made available at: https://github.com/stevenwudi/6DVNET .

[1]  Jean-Michel Morel,et al.  A non-local algorithm for image denoising , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[2]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[3]  Trevor Darrell,et al.  Learning to Segment Every Thing , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Zoltan-Csaba Marton,et al.  Implicit 3D Orientation Learning for 6D Object Detection from RGB Images , 2018, ECCV.

[5]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8]  Eric Brachmann,et al.  Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Pascal Fua,et al.  Real-Time Seamless Single Shot 6D Object Pose Prediction , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Manolis I. A. Lourakis,et al.  Detection and fine 3D pose estimation of texture-less objects in RGB-D images , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[12]  Ruigang Yang,et al.  DeLS-3D: Deep Localization and Segmentation with a 3D Semantic Map , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[14]  Jean Oh,et al.  Social Attention: Modeling Attention in Human Crowds , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Nassir Navab,et al.  SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[19]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[21]  Vincent Lepetit,et al.  Learning descriptors for object recognition and 3D pose estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Eric Brachmann,et al.  Learning 6D Object Pose Estimation Using 3D Object Coordinates , 2014, ECCV.

[24]  Eric Brachmann,et al.  BOP: Benchmark for 6D Object Pose Estimation , 2018, ECCV.

[25]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[26]  Vladlen Koltun,et al.  Open3D: A Modern Library for 3D Data Processing , 2018, ArXiv.

[27]  Ian D. Reid,et al.  LieNet: Real-time Monocular Object Instance 6D Pose Estimation , 2018, BMVC.

[28]  Roberto Cipolla,et al.  Geometric Loss Functions for Camera Pose Regression with Deep Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ruigang Yang,et al.  The ApolloScape Dataset for Autonomous Driving , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[30]  Chyi-Yeu Lin,et al.  6D pose estimation using an improved method based on point pair features , 2018, 2018 4th International Conference on Control, Automation and Robotics (ICCAR).

[31]  Tae-Kyun Kim,et al.  Latent-Class Hough Forests for 3D Object Detection and Pose Estimation , 2014, ECCV.

[32]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Dirk Kraft,et al.  Rotational Subgroup Voting and Pose Clustering for Robust 3D Object Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Kaiming He,et al.  Data Distillation: Towards Omni-Supervised Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Manolis I. A. Lourakis,et al.  T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-Less Objects , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[36]  Yi Li,et al.  DeepIM: Deep Iterative Matching for 6D Pose Estimation , 2018, International Journal of Computer Vision.

[37]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Nassir Navab,et al.  Deep Learning of Local RGB-D Patches for 3D Object Detection and 6D Pose Estimation , 2016, ECCV.

[39]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[40]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[43]  Vincent Lepetit,et al.  Going Further with Point Pair Features , 2016, ECCV.

[44]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).