论文信息 - MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection

MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection

Monocular 3D object detection has long been a challenging task in autonomous driving, which requires to decode 3D predictions solely from a single 2D image. Most existing methods follow conventional 2D object detectors to ﬁrst localize objects by their centers, and then predict 3D attributes using center-neighboring local features. However, such center-based pipeline views 3D prediction as a subordinate task and lacks inter-object depth interactions with global spatial clues. In this paper, we introduce a simple framework for Mono cular DE tection with depth-aware TR ansformer, named MonoDETR . We enable the vanilla transformer to be depth-aware and enforce the whole detection process guided by depth. Speciﬁcally, we represent 3D object candidates as a set of queries and produce non-local depth embeddings of the input image by a lightweight depth predictor and an attention-based depth encoder. Then, we propose a depth-aware decoder to conduct both inter-query and query-scene depth feature communication. In this way, each object estimates its 3D attributes adaptively from the depth-informative regions on the image, not limited by center-around features. With minimal handcrafted designs, MonoDETR is an end-to-end framework without additional data, anchors or NMS and achieves competitive performance on KITTI benchmark among state-of-the-art center-based networks. Extensive ablation studies demonstrate the effectiveness of our approach and its potential to serve as a transformer baseline for future monocular research. Code is available at https: //github.com/ZrrSkywalker/MonoDETR.git .

[1] Xiangyu Zhang,et al. Anchor DETR: Query Design for Transformer-Based Detector , 2022, AAAI.

[2] Dingfu Zhou,et al. AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Gang Zeng,et al. Conditional DETR for Fast Training Convergence , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4] Qi Chu,et al. Geometry Uncertainty Projection Network for Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5] Xinge Zhu,et al. Probabilistic and Geometric Depth: Detecting Objects in Perspective , 2021, CoRL.

[6] Xinge Zhu,et al. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[7] Tae-Kyun Kim,et al. Geometry-based Distance Decomposition for Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8] Jiwen Lu,et al. Objects are Different: Flexible Monocular 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Boxun Li,et al. Efficient DETR: Improving End-to-End Object Detector with Dense Prior , 2021, ArXiv.

[10] Haojie Li,et al. Delving into Localization Errors for Monocular 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Lu Xiong,et al. MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Jin Fang,et al. IAFA: Instance-Aware Feature Aggregation for 3D Object Detection from a Single Image , 2021, ACCV.

[13] Steven L. Waslander,et al. Categorical Depth Distribution Network for Monocular 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Peng Gao,et al. Fast Convergence of DETR with Spatially Modulated Co-Attention , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Yi Jiang,et al. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Yiming Yang,et al. Rethinking Transformer-based Set Prediction for Object Detection , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Xiaogang Wang,et al. End-to-End Object Detection with Adaptive Clustering Transformer , 2020, BMVC.

[18] Junying Chen,et al. UP-DETR: Unsupervised Pre-training for Object Detection with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[20] Wanli Ouyang,et al. Rethinking Pseudo-LiDAR Representation , 2020, ECCV.

[21] Bernt Schiele,et al. Kinematic 3D Object Detection in Monocular Video , 2020, ECCV.

[22] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[23] Mingyang Li,et al. MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Zizhang Wu,et al. SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25] Huaici Zhao,et al. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving , 2020, ECCV.

[26] Zhiwu Lu,et al. Learning Depth-Guided Convolutions for Monocular 3D Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27] Xiaoming Liu,et al. M3D-RPN: Monocular 3D Region Proposal Network for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Xingyi Zhou,et al. Objects as Points , 2019, ArXiv.

[29] Hao Chen,et al. FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30] Silvio Savarese,et al. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Jiong Yang,et al. PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Xiaogang Wang,et al. PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Dacheng Tao,et al. Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34] Yin Zhou,et al. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35] Ross B. Girshick,et al. Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[38] Larry S. Davis,et al. Soft-NMS — Improving Object Detection with One Line of Code , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39] Serge J. Belongie,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Jana Kosecka,et al. 3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Oisin Mac Aodha,et al. Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Sanja Fidler,et al. Monocular 3D Object Detection for Autonomous Driving , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Gustavo Carneiro,et al. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[44] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Andrew G. Berneshawi,et al. 3D Object Proposals for Accurate Object Class Detection , 2015, NIPS.

[46] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47] Andreas Geiger,et al. Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.