Structured Bird’s-Eye-View Traffic Scene Understanding from Onboard Images

Autonomous navigation requires structured representation of the road network and instance-wise identification of the other traffic agents. Since the traffic scene is defined on the ground plane, this corresponds to scene understanding in the bird’s-eye-view (BEV). However, the onboard cameras of autonomous cars are customarily mounted horizontally for a better view of the surrounding, making this task very challenging. In this work, we study the problem of extracting a directed graph representing the local road network in BEV coordinates, from a single onboard camera image. Moreover, we show that the method can be extended to detect dynamic objects on the BEV plane. The semantics, locations, and orientations of the detected objects together with the road graph facilitates a comprehensive understanding of the scene. Such understanding becomes fundamental for the downstream tasks, such as path planning and navigation. We validate our approach against powerful baselines and show that our network achieves superior performance. We also demonstrate the effects of various design choices through ablation studies. Code: https://github.com/ybarancan/STSU

[1]  Bolei Zhou,et al.  Cross-View Semantic Segmentation for Sensing Surroundings , 2019, IEEE Robotics and Automation Letters.

[2]  Pengfei Duan,et al.  FISHING Net: Future Inference of Semantic Heatmaps In Grids , 2020, ArXiv.

[3]  Henggang Cui,et al.  Multimodal Trajectory Predictions for Autonomous Driving using Deep Convolutional Networks , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[4]  Raquel Urtasun,et al.  DAGMapper: Learning to Map by Discovering Lane Topology , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Victor Talpaert,et al.  Real-time Dynamic Object Detection for Autonomous Driving using Prior 3D-Maps , 2018, ECCV Workshops.

[6]  Sanja Fidler,et al.  Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++ , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Raquel Urtasun,et al.  Hierarchical Recurrent Attention Networks for Structured Online Maps , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Raquel Urtasun,et al.  Convolutional Recurrent Network for Road Boundary Extraction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[10]  K. Madhava Krishna,et al.  Mono Lay out: Amodal scene layout from a single image , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  Moongu Jeon,et al.  Key Points Estimation and Point Instance Segmentation Approach for Lane Detection , 2020, ArXiv.

[12]  Hengyuan Zhang,et al.  TridentNet: A Conditional Generative Model for Dynamic Trajectory Generation , 2021, ArXiv.

[13]  Jason Yosinski,et al.  An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , 2018, NeurIPS.

[14]  Raquel Urtasun,et al.  MP3: A Unified Model to Map, Perceive, Predict and Plan , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Luc Van Gool,et al.  Iterative Deep Learning for Road Topology Extraction , 2018, BMVC.

[16]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[17]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Benjamin Sapp,et al.  Rules of the Road: Predicting Driving Behavior With a Convolutional Model of Semantic Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Chenyang Lu,et al.  Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder–Decoder Networks , 2018, IEEE Robotics and Automation Letters.

[20]  Bin Yang,et al.  HDNET: Exploiting HD Maps for 3D Object Detection , 2018, CoRL.

[21]  Costas Armenakis,et al.  Survey of Work on Road Extraction in Aerial and Satellite Images , 2002 .

[22]  C. V. Jawahar,et al.  Improved Road Connectivity by Joint Learning of Orientation and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Maximilian Jaritz,et al.  2D-3D scene understanding for autonomous driving , 2020 .

[24]  Michael Milford,et al.  Semantics for Robotic Mapping, Perception and Interaction: A Survey , 2021, Found. Trends Robotics.

[25]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Mayank Bansal,et al.  ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst , 2018, Robotics: Science and Systems.

[27]  Chen Change Loy,et al.  Learning Lightweight Lane Detection CNNs by Self Attention Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Luc Van Gool,et al.  Understanding Bird's-Eye View Semantic HD-Maps Using an Onboard Monocular Camera , 2020, ArXiv.

[29]  Chun Liu,et al.  Leveraging Crowdsourced GPS Data for Road Extraction From Aerial Imagery , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[32]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Roberto Cipolla,et al.  Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  John A. Richards,et al.  Remote Sensing Digital Image Analysis , 1986 .

[35]  Luc Van Gool,et al.  Action Sequence Predictions of Vehicles in Urban Environments using Map and Social Context , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[36]  Raquel Urtasun,et al.  End-to-End Deep Structured Models for Drawing Crosswalks , 2018, ECCV.

[37]  Luc Van Gool,et al.  Decoder Fusion RNN: Context and Interaction Aware Decoders for Trajectory Prediction , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[38]  Xiaolong Hu,et al.  Autonomous Driving in the iCity—HD Maps as a Key Challenge of the Automotive Industry , 2016 .

[39]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Vladlen Koltun,et al.  Learning by Cheating , 2019, CoRL.

[41]  Raquel Urtasun,et al.  Exploiting Sparse Semantic HD Maps for Self-Driving Vehicle Localization , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).