论文信息 - 2D-3D scene understanding for autonomous driving

2D-3D scene understanding for autonomous driving

In this thesis, we address the challenges of label scarcity and fusion of heterogeneous 3D point clouds and 2D images. We adopt the strategy of end-to-end race driving where a neural network is trained to directly map sensor input (camera image) to control output, which makes this strategy independent from annotations in the visual domain. We employ deep reinforcement learning where the algorithm learns from reward by interaction with a realistic simulator. We propose new training strategies and reward functions for better driving and faster convergence. However, training time is still very long which is why we focus on perception to study point cloud and image fusion in the remainder of this thesis. We propose two different methods for 2D-3D fusion. First, we project 3D LiDAR point clouds into 2D image space, resulting in sparse depth maps. We propose a novel encoder-decoder architecture to fuse dense RGB and sparse depth for the task of depth completion that enhances point cloud resolution to image level. Second, we fuse directly in 3D space to prevent information loss through projection. Therefore, we compute image features with a 2D CNN of multiple views and then lift them all to a global 3D point cloud for fusion, followed by a point-based network to predict 3D semantic labels. Building on this work, we introduce the more difficult novel task of cross-modal unsupervised domain adaptation, where one is provided with multi-modal data in a labeled source and an unlabeled target dataset. We propose to perform 2D-3D cross-modal learning via mutual mimicking between image and point cloud networks to address the source-target domain shift. We further showcase that our method is complementary to the existing uni-modal technique of pseudo-labeling.

Maximilian Jaritz | M. Jaritz

[1] Noah Snavely,et al. Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[3] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Nuno Vasconcelos,et al. Bidirectional Learning for Domain Adaptation of Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Daniel Cremers,et al. FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture , 2016, ACCV.

[6] Zehang Sun,et al. On-road vehicle detection: a review , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Nassir Navab,et al. Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[8] Leonidas J. Guibas,et al. TextureNet: Consistent Local Parametrizations for Learning From High-Resolution Signals on Meshes , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Rob Fergus,et al. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[10] Marc Pollefeys,et al. Semantic3D.net: A new Large-scale Point Cloud Classification Benchmark , 2017, ArXiv.

[11] Wei Wu,et al. PointCNN: Convolution On X-Transformed Points , 2018, NeurIPS.

[12] Ruigang Yang,et al. Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network , 2018, ECCV.

[13] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[14] Antonio M. López,et al. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] M. Pollefeys,et al. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene From Sparse LiDAR Data and Single Color Image , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Thomas Brox,et al. DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[18] Sergey Levine,et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[19] Sertac Karaman,et al. Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[20] Dong-Hyun Lee,et al. Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[21] Patrick Wieschollek,et al. Flex-Convolution - Million-Scale Point-Cloud Learning Beyond Grid-Worlds , 2018, ACCV.

[22] Leonidas J. Guibas,et al. Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23] Marin Toromanoff,et al. End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Steven Lake Waslander,et al. In Defense of Classical Image Processing: Fast Depth Completion on the CPU , 2018, 2018 15th Conference on Computer and Robot Vision (CRV).

[25] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Trevor Darrell,et al. Deep Domain Confusion: Maximizing for Domain Invariance , 2014, CVPR 2014.

[27] Fabio Pizzati,et al. Domain Bridge for Unpaired Image-to-Image Translation and Unsupervised Domain Adaptation , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28] Sebastian Thrun,et al. Junior: The Stanford entry in the Urban Challenge , 2008, J. Field Robotics.

[29] Roberto Cipolla,et al. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30] Wojciech Jaskowski,et al. ViZDoom: A Doom-based AI research platform for visual reinforcement learning , 2016, 2016 IEEE Conference on Computational Intelligence and Games (CIG).

[31] Yan Wang,et al. Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[33] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[34] Oisin Mac Aodha,et al. Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[36] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[37] Xiaofeng Liu,et al. Confidence Regularized Self-Training , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38] Hao Su,et al. Multi-View PointNet for 3D Scene Understanding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[39] Rémi Munos,et al. Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[40] Laurens van der Maaten,et al. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[42] Konstantinos Kamnitsas,et al. Efficient multi‐scale 3D CNN with fully connected CRF for accurate brain lesion segmentation , 2016, Medical Image Anal..

[43] Gabriel J. Brostow,et al. Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44] Michael I. Jordan,et al. Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[45] Jitendra Malik,et al. Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[46] Fawzi Nashashibi,et al. Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation , 2018, 2018 International Conference on 3D Vision (3DV).

[47] Kurt Keutzer,et al. SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[48] Ruigang Yang,et al. CSPN++: Learning Context and Resource Aware Convolutional Spatial Propagation Networks for Depth Completion , 2019, AAAI.

[49] Silvio Savarese,et al. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Andreas Geiger,et al. Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[51] Yue Wang,et al. Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[52] Jie Tang,et al. Learning Guided Convolutional Network for Depth Completion , 2019, IEEE Transactions on Image Processing.

[53] Cyrill Stachniss,et al. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54] Kate Saenko,et al. Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.

[55] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[56] Jan Kautz,et al. Unsupervised Image-to-Image Translation Networks , 2017, NIPS.

[57] Vijay Vasudevan,et al. Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58] Xin Zhang,et al. End to End Learning for Self-Driving Cars , 2016, ArXiv.

[59] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[60] Thomas A. Funkhouser,et al. Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Sertac Karaman,et al. Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[62] Wolfram Burgard,et al. Self-Supervised Model Adaptation for Multimodal Semantic Segmentation , 2018, International Journal of Computer Vision.

[63] David J. Kriegman,et al. Image to Image Translation for Domain Adaptation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64] Huchuan Lu,et al. Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65] Graham W. Taylor,et al. Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[66] Matthias Nießner,et al. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67] Sergey Levine,et al. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[68] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[69] Raquel Urtasun,et al. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[70] Thomas S. Huang,et al. Generative Image Inpainting with Contextual Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[71] Sergey Levine,et al. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[72] Raquel Urtasun,et al. Deep Parametric Continuous Convolutional Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[73] Luc Van Gool,et al. Sparse and Noisy LiDAR Completion with RGB Guidance and Uncertainty , 2019, 2019 16th International Conference on Machine Vision Applications (MVA).

[74] Jörg Stückler,et al. Semi-Supervised Deep Learning for Monocular Depth Map Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .

[76] Chi-Wing Fu,et al. Hierarchical Point-Edge Interaction Network for Point Cloud Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[77] Xiaogang Wang,et al. Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78] Michael Felsberg,et al. Confidence Propagation through CNNs for Guided Sparse Depth Regression , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[79] Raquel Urtasun,et al. Learning Joint 2D-3D Representations for Depth Completion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[80] Philip David,et al. Domain Adaptation for Semantic Segmentation of Urban Scenes , 2017 .

[81] Thomas Brox,et al. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation , 2016, MICCAI.

[82] Yan Wang,et al. Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving , 2019, ICLR.

[83] Patrick Pérez,et al. DADA: Depth-Aware Domain Adaptation in Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[84] Gernot Riegler,et al. OctNet: Learning Deep 3D Representations at High Resolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86] Jianxiong Xiao,et al. DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[87] Alexei A. Efros,et al. Texture synthesis by non-parametric sampling , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[88] Jie Li,et al. SPIGAN: Privileged Adversarial Learning from Simulation , 2018, ICLR.

[89] Ji Wan,et al. Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[90] Winston H. Hsu,et al. A Unified Point-Based Framework for 3D Segmentation , 2019, 2019 International Conference on 3D Vision (3DV).

[91] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[92] Roberto Cipolla,et al. Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding , 2015, BMVC.

[93] Leonidas J. Guibas,et al. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[94] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[95] Leonidas J. Guibas,et al. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[96] George Papandreou,et al. Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[97] Patrick Pérez,et al. ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[98] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[99] Matthias Nießner,et al. 3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation , 2018, ECCV.

[100] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[101] Silvio Savarese,et al. Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[102] Leonidas J. Guibas,et al. KPConv: Flexible and Deformable Convolution for Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[103] Vittorio Murino,et al. Minimal-Entropy Correlation Alignment for Unsupervised Deep Domain Adaptation , 2017, ICLR.

[104] Xiaogang Wang,et al. PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[105] Martin Simonovsky,et al. Large-Scale Point Cloud Semantic Segmentation with Superpoint Graphs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[106] Bin Yang,et al. Deep Continuous Fusion for Multi-sensor 3D Object Detection , 2018, ECCV.

[107] Raoul de Charette,et al. xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[108] Thomas Brox,et al. Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[109] William Whittaker,et al. Autonomous driving in urban environments: Boss and the Urban Challenge , 2008, J. Field Robotics.

[110] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[111] Wolfram Burgard,et al. AdapNet: Adaptive semantic segmentation in adverse environmental conditions , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[112] Larry S. Davis,et al. DCAN: Dual Channel-wise Alignment Networks for Unsupervised Scene Adaptation , 2018, ECCV.

[113] Yen-Cheng Liu,et al. UNO: Uncertainty-aware Noisy-Or Multimodal Fusion for Unanticipated Input Degradation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[114] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[115] Hujun Bao,et al. Depth Completion From Sparse LiDAR Data With Depth-Normal Constraints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[116] Yifan Xu,et al. SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters , 2018, ECCV.

[117] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[118] Edmond Boyer,et al. FeaStNet: Feature-Steered Graph Convolutions for 3D Shape Analysis , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[119] Subhransu Maji,et al. SPLATNet: Sparse Lattice Networks for Point Cloud Processing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[120] Ming-Hsuan Yang,et al. Learning to Adapt Structured Output Space for Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[121] Vladlen Koltun,et al. Learning by Cheating , 2019, CoRL.

[122] Fuxin Li,et al. PointConv: Deep Convolutional Networks on 3D Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[123] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[124] David Janz,et al. Learning to Drive in a Day , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[125] Ashish Kapoor,et al. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles , 2017, FSR.

[126] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[127] Yinda Zhang,et al. Deep Depth Completion of a Single RGB-D Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[128] Angela Dai,et al. SG-NN: Sparse Generative Neural Networks for Self-Supervised Scene Completion of RGB-D Scans , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[129] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[130] Timo Ropinski,et al. Monte Carlo convolution for learning on non-uniformly sampled point clouds , 2018, ACM Trans. Graph..

[131] Michael Elad,et al. Learning Multiscale Sparse Representations for Image and Video Restoration , 2007, Multiscale Model. Simul..

[132] Alexey Dosovitskiy,et al. End-to-End Driving Via Conditional Imitation Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[133] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[134] Bin Yang,et al. SBNet: Sparse Blocks Network for Fast Inference , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[135] Sergey Levine,et al. Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[136] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[137] Kurt Keutzer,et al. SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[138] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[139] Matthias Nießner,et al. ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[140] Taesung Park,et al. CyCADA: Cycle-Consistent Adversarial Domain Adaptation , 2017, ICML.

[141] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[142] Fawzi Nashashibi,et al. End-to-End Race Driving with Deep Reinforcement Learning , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).