Visual SLAM for Automated Driving: Exploring the Applications of Deep Learning

Deep learning has become the standard model for object detection and recognition. Recently, there is progress on using CNN models for geometric vision tasks like depth estimation, optical flow prediction or motion segmentation. However, Visual SLAM remains to be one of the areas of automated driving where CNNs are not mature for deployment in commercial automated driving systems. In this paper, we explore how deep learning can be used to replace parts of the classical Visual SLAM pipeline. Firstly, we describe the building blocks of Visual SLAM pipeline composed of standard geometric vision tasks. Then we provide an overview of Visual SLAM use cases for automated driving based on the authors' experience in commercial deployment. Finally, we discuss the opportunities of using Deep Learning to improve upon state-of-the-art classical methods.

[1]  Guo-Jun Qi,et al.  Hierarchically Gated Deep Networks for Semantic Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jitendra Malik,et al.  Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Peter Mühlfellner,et al.  Evaluation of fisheye-camera based visual multi-session localization in a real-world scenario , 2013, 2013 IEEE Intelligent Vehicles Symposium Workshops (IV Workshops).

[4]  Silvio Savarese,et al.  Universal Correspondence Network , 2016, NIPS.

[5]  Roberto Cipolla,et al.  MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving , 2016, 2018 IEEE Intelligent Vehicles Symposium (IV).

[6]  Thomas Brox,et al.  Object Detection, Tracking, and Motion Segmentation for Object-level Video Segmentation , 2016, ArXiv.

[7]  P. Torr Geometric motion segmentation and model selection , 1998, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[8]  Patrick Pérez,et al.  The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces , 2015, CHI.

[9]  한보형,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015 .

[10]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[11]  Michael Bosse,et al.  Summary Maps for Lifelong Visual Localization , 2016, J. Field Robotics.

[12]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[14]  Thomas Brox,et al.  3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation , 2016, MICCAI.

[15]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Olivier Stasse,et al.  MonoSLAM: Real-Time Single Camera SLAM , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Andreas Geiger,et al.  Bounding Boxes, Segmentations and Object Coordinates: How Important is Recognition for 3D Scene Flow Estimation in Autonomous Driving Scenarios? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Sen Wang,et al.  DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[19]  G. Klein,et al.  Parallel Tracking and Mapping for Small AR Workspaces , 2007, 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality.

[20]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[21]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[24]  Martin Lauer,et al.  3D Traffic Scene Understanding From Movable Platforms , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Eric Brachmann,et al.  DSAC — Differentiable RANSAC for Camera Localization , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yang Gao,et al.  End-to-End Learning of Driving Models from Large-Scale Video Datasets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Patrick Pérez,et al.  Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[33]  L. Bottou,et al.  Deep Convolutional Networks for Scene Parsing , 2009 .

[34]  Paul Newman,et al.  Practice makes perfect? Managing and leveraging visual experiences for lifelong navigation , 2012, 2012 IEEE International Conference on Robotics and Automation.

[35]  Jitendra Malik,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Segmentation of Moving Objects by Long Term Video Analysis , 2022 .

[36]  Roberto Cipolla,et al.  Semantic object classes in video: A high-definition ground truth database , 2009, Pattern Recognit. Lett..

[37]  Xiaohui Xie,et al.  Adversarial Deep Structural Networks for Mammographic Mass Segmentation , 2016, bioRxiv.

[38]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Weifeng Chen,et al.  Single-Image Depth Perception in the Wild , 2016, NIPS.

[40]  Roberto Cipolla,et al.  Convolutional networks for real-time 6-DOF camera relocalization , 2015, ArXiv.

[41]  Konrad Schindler,et al.  3D Scene Flow Estimation with a Piecewise Rigid Scene Model , 2015, International Journal of Computer Vision.

[42]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling , 2015, CVPR 2015.

[43]  Andreas Dengel,et al.  Multi-Task Learning for Segmentation of Building Footprints with Deep Neural Networks , 2017, 2019 IEEE International Conference on Image Processing (ICIP).

[44]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Luca Iocchi,et al.  Human-Robot Collaboration for Semantic Labeling of the Environment , 2013 .

[46]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[47]  Andreas Geiger,et al.  Understanding High-Level Semantics by Modeling Traffic Patterns , 2013, 2013 IEEE International Conference on Computer Vision.

[48]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[49]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Henning Lategahn,et al.  How to learn an illumination robust image feature for place recognition , 2013, 2013 IEEE Intelligent Vehicles Symposium (IV).

[51]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[52]  Karteek Alahari,et al.  Learning Motion Patterns in Videos , 2016, CVPR.

[53]  Sergio Orts,et al.  HyperDepth: Learning Depth from Structured Light without Matching , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  James M. Rehg,et al.  Joint Semantic Segmentation and 3D Reconstruction from Monocular Video , 2014, ECCV.

[57]  Eric Brachmann,et al.  Learning Less is More - 6D Camera Localization via 3D Surface Regression , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[59]  Tom Duckett,et al.  Dynamic Maps for Long-Term Operation of Mobile Service Robots , 2005, Robotics: Science and Systems.

[60]  Tom Duckett,et al.  A multilevel relaxation algorithm for simultaneous localization and mapping , 2005, IEEE Transactions on Robotics.

[61]  Wolfram Burgard,et al.  Deep Multispectral Semantic Scene Understanding of Forested Environments Using Multimodal Fusion , 2016, ISER.

[62]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Kurt Konolige,et al.  Towards lifelong visual maps , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[65]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[67]  Jörg Stückler,et al.  Semi-Supervised Deep Learning for Monocular Depth Map Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[69]  Richard Szeliski,et al.  Video Segmentation with Background Motion Models , 2017, BMVC.