Temporal Consistent 3D LiDAR Representation Learning for Semantic Perception in Autonomous Driving

Semantic perception is a core building block in autonomous driving, since it provides information about the drivable space and location of other traffic participants. For learning-based perception, often a large amount of diverse training data is necessary to achieve high performance. Data labeling is usually a bottleneck for developing such methods, especially for dense prediction tasks, e.g., semantic segmentation or panoptic segmentation. For 3D Li-DAR data, the annotation process demands even more effort than for images. Especially in autonomous driving, point clouds are sparse, and objects appearance depends on its distance from the sensor, making it harder to acquire large amounts of labeled training data. This paper aims at taking an alternative path proposing a self-supervised representation learning method for 3D LiDAR data. Our approach exploits the vehicle motion to match objects across time viewed in different scans. We then train a model to maximize the point-wise feature similarities from points of the associated object in different scans, which enables to learn a consistent representation across time. The experimental results show that our approach performs better than previous state-of-the-art self-supervised representation learning methods when fine-tuning to different downstream tasks. We furthermore show that with only 10% of labeled data, a network pre-trained with our approach can achieve better performance than the same network trained from scratch with all labels for semantic segmentation on SemanticKITTI. 11Code: https://github.com/PRBonn/TARL

[1]  C. Stachniss,et al.  KISS-ICP: In Defense of Point-to-Point ICP – Simple, Accurate, and Robust Registration If Done the Right Way , 2022, IEEE Robotics and Automation Letters.

[2]  Liangjun Zhang,et al.  ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object Detection , 2022, ECCV.

[3]  C. Stachniss,et al.  Receding Moving Object Segmentation in 3D LiDAR Data Using Sparse 4D Convolutions , 2022, IEEE Robotics and Automation Letters.

[4]  Jinwoo Shin,et al.  Patch-level Representation Learning for Self-supervised Vision Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  C. Stachniss,et al.  SegContrast: 3D Point Cloud Feature Representation Learning Through Self-Supervised Segment Discrimination , 2022, IEEE Robotics and Automation Letters.

[6]  L. Gool,et al.  Scribble-Supervised LiDAR Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  R. Rodrigo,et al.  CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Sung Ju Hwang,et al.  MPViT: Multi-Path Vision Transformer for Dense Prediction , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  A. Yuille,et al.  Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  J. Álvarez,et al.  A-ViT: Adaptive Tokens for Efficient Vision Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  M. Nießner,et al.  4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding , 2021, ECCV.

[12]  Shijian Lu,et al.  PTTR: Relational 3D Point Cloud Object Tracking with Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jiwen Lu,et al.  Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Haocheng Wan,et al.  PatchFormer: An Efficient Point Transformer with Patch Attention , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  N. Gosala,et al.  Unsupervised Domain Adaptation for LiDAR Panoptic Segmentation , 2021, IEEE Robotics and Automation Letters.

[16]  Jean-Emmanuel Deschaud,et al.  CT-ICP: Real-time Elastic LiDAR Odometry with Loop Closure , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[17]  Rohit Girdhar,et al.  An End-to-End Transformer Model for 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Rohit Mohan,et al.  Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking , 2021, IEEE Robotics and Automation Letters.

[19]  Chuan-Sheng Foo,et al.  Point Discriminative Learning for Unsupervised Representation Learning on 3D Point Clouds , 2021, ArXiv.

[20]  Mingming Gong,et al.  Exploring Set Similarity for Dense Self-supervised Representation Learning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Cyrill Stachniss,et al.  Moving Object Segmentation in 3D LiDAR Data: A Learning-Based Approach Exploiting Sequential Data , 2021, IEEE Robotics and Automation Letters.

[23]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Ling Shao,et al.  Kaleido-BERT: Vision-Language Pre-training on Fashion Domain , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hassan Foroosh,et al.  Panoptic-PolarNet: Proposal-free LiDAR Point Cloud Panoptic Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xiaojuan Qi,et al.  ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  H. Myung,et al.  ERASOR: Egocentric Ratio of Pseudo Occupancy-Based Dynamic Object Removal for Static 3D Point Cloud Map Building , 2021, IEEE Robotics and Automation Letters.

[29]  Yann LeCun,et al.  Barlow Twins: Self-Supervised Learning via Redundancy Reduction , 2021, ICML.

[30]  C. Stachniss,et al.  4D Panoptic LiDAR Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  L. Gool,et al.  Exploring Cross-Image Pixel Contrast for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Rohit Girdhar,et al.  Self-Supervised Pretraining of 3D Features on any Point-Cloud , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Saining Xie,et al.  Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ying Wu,et al.  Contrastive Learning for Label Efficient Semantic Segmentation , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Alan Yuille,et al.  Robust Instance Segmentation through Reasoning about Multi-Object Occlusion , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Xinge Zhu,et al.  LiDAR-based Panoptic Segmentation via Dynamic Shifting Network , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Xinge Zhu,et al.  Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Tao Kong,et al.  Dense Contrastive Learning for Self-Supervised Visual Pre-Training , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Aaron C. Courville,et al.  Unsupervised Learning of Dense Visual Representations , 2020, NeurIPS.

[41]  C. Stachniss,et al.  Domain Transfer for Semantic Segmentation of LiDAR Data using Deep Neural Networks , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[42]  C. Stachniss,et al.  LiDAR Panoptic Segmentation for Autonomous Driving , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[43]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[44]  Matt J. Kusner,et al.  Unsupervised Point Cloud Pre-training via Occlusion Completion , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Song Han,et al.  Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution , 2020, ECCV.

[46]  Leonidas J. Guibas,et al.  PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding , 2020, ECCV.

[47]  Thomas Funkhouser,et al.  Complete & Label: A Domain Adaptation Approach to Semantic Segmentation of LiDAR Point Clouds , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[49]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[50]  Aditya Sanghi,et al.  Info3D: Representation Learning on 3D Objects using Mutual Information Maximization and Contrastive Learning , 2020, ECCV.

[51]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[52]  Weijing Shi,et al.  Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[54]  Stefan Milz,et al.  StickyPillars: Robust and Efficient Feature Matching on Point Clouds using Graph Neural Networks , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Cyrill Stachniss,et al.  SuMa++: Efficient LiDAR-based Semantic SLAM , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[58]  Ling Zhang,et al.  Unsupervised Feature Learning for Point Cloud Understanding by Contrasting and Clustering Using Graph Convolutional Neural Networks , 2019, 2019 International Conference on 3D Vision (3DV).

[59]  Xiaogang Wang,et al.  From Points to Parts: 3D Object Detection From Point Cloud With Part-Aware and Part-Aggregation Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Lei Wang,et al.  Appendix for : Graph Attention Convolution for Point Cloud Semantic Segmentation , 2019 .

[61]  Cyrill Stachniss,et al.  Fast Instance and Semantic Segmentation Exploiting Local Connectivity, Metric Learning, and One-Shot Detection for Robotics , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[62]  Silvio Savarese,et al.  4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Leonidas J. Guibas,et al.  KPConv: Flexible and Deformable Convolution for Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[64]  Cyrill Stachniss,et al.  SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Xiaogang Wang,et al.  PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Cyrill Stachniss,et al.  Efficient Surfel-Based SLAM using 3D Laser Range Data in Urban Environments , 2018, Robotics: Science and Systems.

[69]  Cyrill Stachniss,et al.  Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics using CNNs , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[70]  Laurens van der Maaten,et al.  3D Semantic Segmentation with Submanifold Sparse Convolutional Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[71]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[72]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[73]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[74]  Laurens van der Maaten,et al.  Submanifold Sparse Convolutional Networks , 2017, ArXiv.

[75]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Tatsuya Harada,et al.  Image Reconstruction from Bag-of-Visual-Words , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[77]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[78]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[79]  Cyrill Stachniss,et al.  Contrastive Instance Association for 4D Panoptic Segmentation Using Sequences of 3D LiDAR Scans , 2022, IEEE Robotics and Automation Letters.

[80]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[81]  Self-Supervised Pretraining of 3D Features on any Point-Cloud , 2021 .

[82]  J. Vergeest,et al.  Practical and Computational Issues of Reverse / Forward Engineering of Shape , 2022 .