StreamMapNet: Streaming Mapping Network for Vectorized Online HD Map Construction

High-Definition (HD) maps are essential for the safety of autonomous driving systems. While existing techniques employ camera images and onboard sensors to generate vectorized high-precision maps, they are constrained by their reliance on single-frame input. This approach limits their stability and performance in complex scenarios such as occlusions, largely due to the absence of temporal information. Moreover, their performance diminishes when applied to broader perception ranges. In this paper, we present StreamMapNet, a novel online mapping pipeline adept at long-sequence temporal modeling of videos. StreamMapNet employs multi-point attention and temporal information which empowers the construction of large-range local HD maps with high stability and further addresses the limitations of existing methods. Furthermore, we critically examine widely used online HD Map construction benchmark and datasets, Argoverse2 and nuScenes, revealing significant bias in the existing evaluation protocols. We propose to resplit the benchmarks according to geographical spans, promoting fair and precise evaluations. Experimental results validate that StreamMapNet significantly outperforms existing methods across all settings while maintaining an online inference speed of $14.2$ FPS. Our code is available at https://github.com/yuantianyuan01/StreamMapNet.

[1]  Xi Qiu,et al.  End-to-End Vectorized HD-map Construction with Piecewise Bézier Curve , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Lichao Huang,et al.  Sparse4D v2: Recurrent Temporal Fusion with Sparse Model , 2023, ArXiv.

[3]  Yilun Wang,et al.  Neural Map Prior for Autonomous Driving , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Tiancai Wang,et al.  Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Jinrong Yang,et al.  Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception , 2023, ArXiv.

[6]  James Hays,et al.  Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting , 2023, NeurIPS Datasets and Benchmarks.

[7]  Jifeng Dai,et al.  BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jinhyung D. Park,et al.  Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection , 2022, ICLR.

[9]  Chang Huang,et al.  MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction , 2022, ICLR.

[10]  Yilun Wang,et al.  VectorMapNet: End-to-end Vectorized HD Map Learning , 2022, ICML.

[11]  Jifeng Dai,et al.  BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers , 2022, ECCV.

[12]  Junjie Huang,et al.  BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection , 2022, ArXiv.

[13]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[14]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[15]  R. Urtasun,et al.  Learning Lane Graph Representations for Motion Forecasting , 2020, ECCV.

[16]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[17]  Dragomir Anguelov,et al.  VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Roberto Cipolla,et al.  Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Brendan Englot,et al.  LeGO-LOAM: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[21]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[26]  Ji Zhang,et al.  LOAM: Lidar Odometry and Mapping in Real-time , 2014, Robotics: Science and Systems.

[27]  Adam W. Harley,et al.  A Simple Baseline for BEV Perception Without LiDAR , 2022, ArXiv.

[28]  Yilun Wang,et al.  HDMapNet: A Local Semantic Map Learning and Evaluation Framework , 2021 .

[29]  J. Little,et al.  Inverse perspective mapping simplifies optical flow computation and obstacle detection , 2004, Biological Cybernetics.