Monocular 3D Object Detection Using Feature Map Transformation: Towards Learning Perspective-Invariant Scene Representations

In this paper we propose to use a feature map transformation network for the task of monocular 3D object detection. Given a monocular camera image, the transformation network encodes features of the scene in an abstract, perspective-invariant latent representation. This latent representation can then be decoded into a bird's-eye view representation to estimate objects' position and rotation in 3D space. In our experiments on the Kitti object detection dataset we show that our model is able to learn to estimate objects' 3D position from a monocular camera image alone without having any explicit geometric model or other prior information on how to perform the transformation. While performing slightly worse than networks which are purpose-built for this task, our approach allows feeding the same bird's-eye view object detection network with input data from different sensor modalities. This can increase redundancy in a safety-critical environment. We present additional experiments to gain insight into the properties of the learned perspective-invariant abstract scene representation.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[4]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[5]  Steven L. Waslander,et al.  Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Huimin Ma,et al.  3D Object Proposals for Accurate Object Class Detection , 2015, NIPS.

[7]  Fred H. Hamker,et al.  Feature Map Transformation for Multi-sensor Fusion in Object Detection Networks for Autonomous Driving , 2019, Advances in Intelligent Systems and Computing.

[8]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Horst-Michael Groß,et al.  Complex-YOLO: An Euler-Region-Proposal for Real-Time 3D Object Detection on Point Clouds , 2018, ECCV Workshops.

[10]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[11]  Sanja Fidler,et al.  Monocular 3D Object Detection for Autonomous Driving , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Graham W. Taylor,et al.  Deconvolutional networks , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Kris Kitani,et al.  Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[14]  Ji Wan,et al.  Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Shaojie Shen,et al.  Stereo R-CNN Based 3D Object Detection for Autonomous Driving , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Bin Yang,et al.  Multi-Task Multi-Sensor Fusion for 3D Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Xiaogang Wang,et al.  GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xiaogang Wang,et al.  Part-A2 Net: 3D Part-Aware and Aggregation Neural Network for Object Detection from Point Cloud , 2019, ArXiv.

[22]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[23]  Roberto Cipolla,et al.  Orthographic Feature Transform for Monocular 3D Object Detection , 2018, BMVC.

[24]  Honglak Lee,et al.  Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[25]  Adrien Gaidon,et al.  ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Marius Leordeanu,et al.  Shift R-CNN: Deep Monocular 3D Object Detection With Closed-Form Geometric Constraints , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[27]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[28]  Jiwen Lu,et al.  Deep Fitting Degree Scoring Network for Monocular 3D Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Haojie Li,et al.  Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Yan Wang,et al.  Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[33]  Christoph Stiller,et al.  Realtime 3D Object Detection for Automated Driving Using Stereo Vision and Semantic Information , 2019, 2019 IEEE Intelligent Transportation Systems Conference (ITSC).

[34]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[35]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .