CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

: Bird’s eye view (BEV) semantic segmentation plays a crucial role in spatial sensing for autonomous driving. Although recent literature has made significant progress on BEV map understanding, they are all based on single-agent camera-based systems which are difficult to handle occlusions and detect distant objects in complex traffic scenes. Vehicle-to-Vehicle (V2V) communication technologies have enabled autonomous vehicles to share sensing information, which can dramatically improve the perception performance and range as compared to single-agent systems. In this paper, we propose CoBEVT, the first generic multi-agent multi-camera perception framework that can cooperatively generate BEV map predictions.To efficiently fuse camera features from multi-view and multi-agent data in an underlying Transformer architecture, we design a fused axial attention or FAX module, which can capture sparsely local and global spatial interactions across views and agents. The extensive experiments on the V2V perception dataset, OPV2V, demonstrate that CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation. Moreover, CoBEVT is shown to be generalizable to other tasks, including 1) BEV segmentation with single-agent multi-camera and 2) 3D object detection with multi-agent LiDAR systems, and achieves state-of-the-art performance with real-time inference speed.

[1]  Philipp Krahenbuhl,et al.  Cross-view Transformers for real-time Map-view Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  P. Milanfar,et al.  MaxViT: Multi-Axis Vision Transformer , 2022, ECCV.

[3]  Jifeng Dai,et al.  BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers , 2022, ECCV.

[4]  Lantao Liu,et al.  Model-Agnostic Multi-Agent Perception Framework , 2022, ArXiv.

[5]  Ming-Hsuan Yang,et al.  V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer , 2022, ECCV.

[6]  J. Malik,et al.  MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  R. Bowden,et al.  Translating Images into Maps , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[8]  Monika Sester,et al.  Keypoints-Based Deep Feature Fusion for Cooperative Vehicle Detection of Autonomous Driving , 2021, IEEE Robotics and Automation Letters.

[9]  Xin Xia,et al.  OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[10]  Nenghai Yu,et al.  CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Stephen Lin,et al.  Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Chen Feng,et al.  Learning Distilled Collaboration Graph for Multi-Agent Perception , 2021, NeurIPS.

[13]  Jiaqi Ma,et al.  Overleaf Example , 2021 .

[14]  Lu Yuan,et al.  Focal Self-attention for Local-Global Interactions in Vision Transformers , 2021, ArXiv.

[15]  Matthijs Douze,et al.  XCiT: Cross-Covariance Image Transformers , 2021, NeurIPS.

[16]  Zilong Huang,et al.  Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer , 2021, ArXiv.

[17]  Dacheng Tao,et al.  ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias , 2021, NeurIPS.

[18]  Jiwen Lu,et al.  DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , 2021, NeurIPS.

[19]  Shengfeng He,et al.  Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Quoc V. Le,et al.  Pay Attention to MLPs , 2021, NeurIPS.

[21]  Christoph Feichtenhofer,et al.  Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  R. Cipolla,et al.  FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Quoc V. Le,et al.  EfficientNetV2: Smaller Models and Faster Training , 2021, ICML.

[24]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Huei Peng,et al.  Monocular 3D Vehicle Detection Using Uncalibrated Traffic Cameras through Homography , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[28]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Raquel Urtasun,et al.  V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction , 2020, ECCV.

[30]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[31]  Joseph E. Gonzalez,et al.  BEV-Seg: Bird's Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud , 2020, ArXiv.

[32]  Lutz Eckstein,et al.  A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird’s Eye View , 2020, 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC).

[33]  A. Yuille,et al.  Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation , 2020, ECCV.

[34]  K. Madhava Krishna,et al.  Mono Lay out: Amodal scene layout from a single image , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[35]  Bolei Zhou,et al.  Cross-View Semantic Segmentation for Sensing Surroundings , 2019, IEEE Robotics and Automation Letters.

[36]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Qi Chen,et al.  F-cooper: feature based cooperative perception for autonomous vehicle edge computing system using 3D point clouds , 2019, SEC.

[38]  Dongsuk Kum,et al.  Deep Learning based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image , 2019, 2019 IEEE Intelligent Vehicles Symposium (IV).

[39]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[40]  Qing Yang,et al.  Cooper: Cooperative Perception for Connected Autonomous Vehicles Based on 3D Point Clouds , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[41]  Andrew Zisserman,et al.  A Geometric Approach to Obtain a Bird's Eye View From an Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[42]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Paul Newman,et al.  The Right (Angled) Perspective: Improving the Understanding of Road Scenes Using Boosted Inverse Perspective Mapping , 2018, 2019 IEEE Intelligent Vehicles Symposium (IV).

[44]  Roberto Cipolla,et al.  Orthographic Feature Transform for Monocular 3D Object Detection , 2018, BMVC.

[45]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[46]  Zaydoun Yahya Rawashdeh,et al.  Collaborative Automated Driving: A Machine Learning-based Method to Enhance the Accuracy of Shared Information , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[47]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[50]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[51]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[52]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[54]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[55]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[56]  J. Little,et al.  Inverse perspective mapping simplifies optical flow computation and obstacle detection , 2004, Biological Cybernetics.