论文信息 - PolarFormer: Multi-camera 3D Object Detection with Polar Transformers

PolarFormer: Multi-camera 3D Object Detection with Polar Transformers

3D object detection in autonomous driving aims to reason"what"and"where"the objects of interest present in a 3D world. Following the conventional wisdom of previous 2D object detection, existing methods often adopt the canonical Cartesian coordinate system with perpendicular axis. However, we conjugate that this does not fit the nature of the ego car's perspective, as each onboard camera perceives the world in shape of wedge intrinsic to the imaging geometry with radical (non-perpendicular) axis. Hence, in this paper we advocate the exploitation of the Polar coordinate system and propose a new Polar Transformer (PolarFormer) for more accurate 3D object detection in the bird's-eye-view (BEV) taking as input only multi-camera 2D images. Specifically, we design a cross attention based Polar detection head without restriction to the shape of input structure to deal with irregular Polar grids. For tackling the unconstrained object scale variations along Polar's distance dimension, we further introduce a multi-scalePolar representation learning strategy. As a result, our model can make best use of the Polar representation rasterized via attending to the corresponding image observation in a sequence-to-sequence fashion subject to the geometric constraints. Thorough experiments on the nuScenes dataset demonstrate that our PolarFormer outperforms significantly state-of-the-art 3D object detection alternatives.

[1] Xiatian Zhu,et al. Learning Ego 3D Representation as Ray Tracing , 2022, ECCV.

[2] S. Fidler,et al. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation , 2022, ArXiv.

[3] Jifeng Dai,et al. BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers , 2022, ECCV.

[4] Junjie Huang,et al. BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection , 2022, ArXiv.

[5] R. Bowden,et al. Translating Images into Maps , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[6] A. Iosifidis,et al. 3D object detection and tracking , 2022, Deep Learning for Robot Perception and Cognition.

[7] Yilun Wang,et al. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries , 2021, CoRL.

[8] Luc Van Gool,et al. Structured Bird’s-Eye-View Traffic Scene Understanding from Onboard Images , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Andreas Geiger,et al. NEAT: Neural Attention Fields for End-to-End Autonomous Driving , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] Rares Ambrus,et al. Is Pseudo-Lidar needed for Monocular 3D Object detection? , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11] Li Zhang,et al. Progressive Coordinate Transforms for Monocular 3D Object Detection , 2021, NeurIPS.

[12] Xinge Zhu,et al. Probabilistic and Geometric Depth: Detecting Objects in Perspective , 2021, CoRL.

[13] Shengfeng He,et al. Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Xinge Zhu,et al. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[15] R. Cipolla,et al. FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Guodong Guo,et al. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Steven L. Waslander,et al. Categorical Depth Distribution Network for Monocular 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Dan Raviv,et al. It’s All Around You: Range-Guided Cylindrical Network for 3D Object Detection , 2020, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[19] Xinge Zhu,et al. Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[21] Philipp Krähenbühl,et al. Center-based 3D Object Detection and Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Srikanth Malla,et al. Bird's Eye View Segmentation Using Lifted 2D Semantic Features , 2021, BMVC.

[23] Sourabh Vora,et al. PolarStream: Streaming Object Detection and Segmentation with Polar Pillars , 2021, NeurIPS.

[24] Yilun Wang,et al. HDMapNet: A Local Semantic Map Learning and Evaluation Framework , 2021 .

[25] Sanja Fidler,et al. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[26] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[27] Dragomir Anguelov,et al. Range Conditioned Dilated Convolutions for Scale Invariant 3D Object Detection , 2020, CoRL.

[28] Philip David,et al. PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Roberto Cipolla,et al. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Zhiwu Lu,et al. Learning Depth-Guided Convolutions for Monocular 3D Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31] Yan Wang,et al. Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving , 2019, ICLR.

[32] Bolei Zhou,et al. Cross-View Semantic Segmentation for Sensing Surroundings , 2019, IEEE Robotics and Automation Letters.

[33] Rares Ambrus,et al. 3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Qiang Xu,et al. nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Ross B. Girshick,et al. Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[37] Jongyoul Park,et al. An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38] Xingyi Zhou,et al. Objects as Points , 2019, ArXiv.

[39] Hao Chen,et al. FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Haojie Li,et al. Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41] Yan Wang,et al. Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Roberto Cipolla,et al. Orthographic Feature Transform for Monocular 3D Object Detection , 2018, BMVC.

[43] Bin Xu,et al. Multi-level Fusion Based 3D Object Detection from Monocular Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[46] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.