MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization

Detecting and localizing objects in the real 3D space, which plays a crucial role in scene understanding, is particularly challenging given only a single RGB image due to the geometric information loss during imagery projection. We propose MonoGRNet for the amodal 3D object detection from a monocular RGB image via geometric reasoning in both the observed 2D projection and the unobserved depth dimension. MonoGRNet is a single, unified network composed of four task-specific subnetworks, responsible for 2D object detection, instance depth estimation (IDE), 3D localization and local corner regression. Unlike the pixel-level depth estimation that needs per-pixel annotations, we propose a novel IDE method that directly predicts the depth of the targeting 3D bounding box's center using sparse supervision. The 3D localization is further achieved by estimating the position in the horizontal and vertical dimensions. Finally, MonoGRNet is jointly learned by optimizing the locations and poses of the 3D bounding boxes in the global context. We demonstrate that MonoGRNet achieves state-of-the-art performance on challenging datasets.

[1]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Long Quan,et al.  Image-Based Building Regularization Using Structural Linear Features , 2016, IEEE Transactions on Visualization and Computer Graphics.

[3]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sanja Fidler,et al.  Monocular 3D Object Detection for Autonomous Driving , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Roberto Cipolla,et al.  MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving , 2016, 2018 IEEE Intelligent Vehicles Symposium (IV).

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Xuming He,et al.  3D Box Proposals From a Single Monocular Image of an Indoor Scene , 2018, AAAI.

[8]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[9]  Wei Liu,et al.  DSSD : Deconvolutional Single Shot Detector , 2017, ArXiv.

[10]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Jianxiong Xiao,et al.  Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Huimin Ma,et al.  3D Object Proposals for Accurate Object Class Detection , 2015, NIPS.

[13]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[14]  Ji Wan,et al.  Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[16]  Long Quan,et al.  Higher-Order CRF Structural Segmentation of 3D Reconstructed Surfaces , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Cheng Zhang,et al.  Emphasizing 3D Properties in Recurrent Multi-View Aggregation for 3D Shape Retrieval , 2018, AAAI.

[19]  Long Quan,et al.  Joint Segmentation of Images and Scanned Point Cloud in Large-Scale Street Scenes With Low-Annotation Cost , 2014, IEEE Transactions on Image Processing.

[20]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[21]  Thierry Chateau,et al.  Deep MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis from Monocular Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Deyu Wang,et al.  Group-Pair Convolutional Neural Networks for Multi-View Based 3D Object Retrieval , 2018, AAAI.

[23]  Fabio Tozeto Ramos,et al.  Iterative Continuous Convolution for 3D Template Matching and Global Localization , 2018, AAAI.

[24]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Bin Xu,et al.  Multi-level Fusion Based 3D Object Detection from Monocular Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Nassir Navab,et al.  SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).