VectorMapNet: End-to-end Vectorized HD Map Learning

Autonomous driving systems require a good understanding of surrounding environments, including moving obstacles and static High-Definition (HD) semantic map elements. Existing methods approach the semantic map problem by offline manual annotation, which suffers from serious scalability issues. Recent learning-based methods produce dense rasterized segmentation predictions to construct maps. However, these predictions do not include instance information of individual map elements and require heuristic post-processing to obtain vectorized maps. To tackle these challenges, we introduce an end-to-end vectorized HD map learning pipeline, termed VectorMapNet. VectorMapNet takes onboard sensor observations and predicts a sparse set of polylines in the bird's-eye view. This pipeline can explicitly model the spatial relation between map elements and generate vectorized maps that are friendly to downstream autonomous driving tasks. Extensive experiments show that VectorMapNet achieve strong map learning performance on both nuScenes and Argoverse2 dataset, surpassing previous state-of-the-art methods by 14.2 mAP and 14.6mAP. Qualitatively, we also show that VectorMapNet is capable of generating comprehensive maps and capturing more fine-grained details of road geometry. To the best of our knowledge, VectorMapNet is the first work designed towards end-to-end vectorized map learning from onboard observations. Our project website is available at https://tsinghua-mars-lab.github.io/vectormapnet/.

[1]  Philipp Krahenbuhl,et al.  Cross-view Transformers for real-time Map-view Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yilun Wang,et al.  FUTR3D: A Unified Sensor Fusion Framework for 3D Detection , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[3]  Xin Tan,et al.  Rethinking Efficient Lane Detection via Curve Modeling , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  L. Ni,et al.  DN-DETR: Accelerate DETR Training by Introducing Query DeNoising , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  James Hays,et al.  Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting , 2023, NeurIPS Datasets and Benchmarks.

[6]  Yilun Wang,et al.  DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries , 2021, CoRL.

[7]  Luc Van Gool,et al.  Structured Bird’s-Eye-View Traffic Scene Understanding from Onboard Images , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Shengfeng He,et al.  Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Cordelia Schmid,et al.  HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Wolfram Burgard,et al.  Lane Graph Estimation for Scene Understanding in Urban Driving , 2021, IEEE Robotics and Automation Letters.

[11]  Bolei Zhou,et al.  Multimodal Motion Prediction with Stacked Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Raquel Urtasun,et al.  MP3: A Unified Model to Map, Perceive, Predict and Plan , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Z. Tu,et al.  Line Segment Detection Using Transformers without Edges , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[15]  Yilun Wang,et al.  HDMapNet: A Local Semantic Map Learning and Evaluation Framework , 2021 .

[16]  Luc Van Gool,et al.  Understanding Bird's-Eye View Semantic HD-Maps Using an Onboard Monocular Camera , 2020, ArXiv.

[17]  Yi Shen,et al.  TNT: Target-driveN Trajectory Prediction , 2020, CoRL.

[18]  Sergio Casas,et al.  Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations , 2020, ECCV.

[19]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[20]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[21]  Dragomir Anguelov,et al.  VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Q. Lu,et al.  LGSVL Simulator: A High Fidelity Simulator for Autonomous Driving , 2020, 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC).

[23]  Roberto Cipolla,et al.  Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  S. M. Ali Eslami,et al.  PolyGen: An Autoregressive Generative Model of 3D Meshes , 2020, ICML.

[25]  Jian Yang,et al.  Line-CNN: End-to-End Traffic Line Detection With Line Proposal Unit , 2020, IEEE Transactions on Intelligent Transportation Systems.

[26]  Bolei Zhou,et al.  Cross-View Semantic Segmentation for Sensing Surroundings , 2019, IEEE Robotics and Automation Letters.

[27]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yin Zhou,et al.  End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds , 2019, CoRL.

[29]  Leonidas J. Guibas,et al.  StructureNet , 2019, ACM Trans. Graph..

[30]  Simon Lucey,et al.  Argoverse: 3D Tracking and Forecasting With Rich Maps , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Raquel Urtasun,et al.  Convolutional Recurrent Network for Road Boundary Extraction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[33]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Luc Van Gool,et al.  End-to-end Lane Detection through Differentiable Least-Squares Fitting , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[35]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Chenyang Lu,et al.  Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder–Decoder Networks , 2018, IEEE Robotics and Automation Letters.

[37]  Bin Yang,et al.  HDNET: Exploiting HD Maps for 3D Object Detection , 2018, CoRL.

[38]  Raquel Urtasun,et al.  Hierarchical Recurrent Attention Networks for Structured Online Maps , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[40]  Luc Van Gool,et al.  Towards End-to-End Lane Detection: an Instance Segmentation Approach , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[41]  Xiaogang Wang,et al.  Spatial As Deep: Spatial CNN for Traffic Scene Understanding , 2017, AAAI.

[42]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[43]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[44]  In So Kweon,et al.  VPGNet: Vanishing Point Guided Network for Lane and Road Marking Detection and Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Leonidas J. Guibas,et al.  GRASS: Generative Recursive Autoencoders for Shape Structures , 2017, ACM Trans. Graph..

[47]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Min Bai,et al.  TorontoCity: Seeing the World with a Million Eyes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Sanja Fidler,et al.  HD Maps: Fine-Grained Road Segmentation by Parsing Ground and Aerial Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Sanja Fidler,et al.  Enhancing Road Maps by Parsing Aerial Images Around the World , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[53]  Sanja Fidler,et al.  Holistic 3D scene understanding from a single geo-tagged image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Haim Kaplan,et al.  Computing the Discrete Fréchet Distance in Subquadratic Time , 2012, SIAM J. Comput..

[55]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[56]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[57]  J. Little,et al.  Inverse perspective mapping simplifies optical flow computation and obstacle detection , 2004, Biological Cybernetics.

[58]  H. Mannila,et al.  Computing Discrete Fréchet Distance ∗ , 1994 .