Multimodal End-to-End Autonomous Driving

Autonomous vehicles (AVs) are key for the intelligent mobility of the future. A crucial component of an AV is the artificial intelligence (AI) able to drive towards a desired destination. Today, there are different paradigms addressing the development of AI drivers. On the one hand, we find modular pipelines, which divide the driving task into sub-tasks such as perception (object detection, semantic segmentation, depth estimation, tracking) and maneuver control (local path planing and control). On the other hand, we find end-to-end driving approaches that try to learn a direct mapping from input raw sensor data to vehicle control signals (the steering angle). The later are relatively less studied, but are gaining popularity since they are less demanding in terms of sensor data annotation. This paper focuses on end-to-end autonomous driving. So far, most proposals relying on this paradigm assume RGB images as input sensor data. However, AVs will not be equipped only with cameras, but also with active sensors providing accurate depth information (traditional LiDARs, or new solid state ones). Accordingly, this paper analyses if RGB and depth data, RGBD data, can actually act as complementary information in a multimodal end-to-end driving approach, producing a better AI driver. Using the CARLA simulator functionalities, its standard benchmark, and conditional imitation learning (CIL), we will show how, indeed, RGBD gives rise to more successful end-to-end AI drivers. We will compare the use of RGBD information by means of early, mid and late fusion schemes, both in multisensory and single-sensor (monocular depth estimation) settings.

[1]  Alexey Dosovitskiy,et al.  End-to-End Driving Via Conditional Imitation Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[3]  Siddhartha S. Srinivasa,et al.  Imitation learning for locomotion and manipulation , 2007, 2007 7th IEEE-RAS International Conference on Humanoid Robots.

[4]  Min Bai,et al.  Deep Watershed Transform for Instance Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Shigeki Sugano,et al.  Rethinking Self-driving: Multi-task Knowledge for Better Generalization and Accident Explanation Ability , 2018, ArXiv.

[7]  Guy Rosman,et al.  Variational Autoencoder for End-to-End Control of Autonomous Driving with Novelty Detection and Training De-biasing , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[8]  Tatsuya Harada,et al.  MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[9]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Ayoung Kim,et al.  Direct Visual SLAM Using Sparse Depth for Camera-LiDAR System , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Sanja Fidler,et al.  SGN: Sequential Grouping Networks for Instance Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Lawrence D. Jackel,et al.  Explaining How a Deep Neural Network Trained with End-to-End Learning Steers a Car , 2017, ArXiv.

[13]  Liang Lin,et al.  Single View Stereo Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Ji Wan,et al.  Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yi Li,et al.  Robust SLAM system based on monocular vision and LiDAR for robotic urban search and rescue , 2017, 2017 IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR).

[16]  Paul Newman,et al.  1 year, 1000 km: The Oxford RobotCar dataset , 2017, Int. J. Robotics Res..

[17]  John F. Canny,et al.  Interpretable Learning for Self-Driving Cars by Visualizing Causal Attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Mahmoud Saeed,et al.  End-To-End Multi-Modal Sensors Fusion System For Urban Automated Driving , 2018 .

[19]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Omar Y. Al-Jarrah,et al.  A Survey on 3D Object Detection Methods for Autonomous Driving Applications , 2019, IEEE Transactions on Intelligent Transportation Systems.

[23]  J. Serrat,et al.  Multiple vehicle 3D tracking using an unscented Kalman , 2005, Proceedings. 2005 IEEE Intelligent Transportation Systems, 2005..

[24]  Bo Li,et al.  SECOND: Sparsely Embedded Convolutional Detection , 2018, Sensors.

[25]  Wilfried Philips,et al.  Behavioral Pedestrian Tracking Using a Camera and LiDAR Sensors on a Moving Vehicle , 2019, Sensors.

[26]  K. Madhava Krishna,et al.  Beyond Pixels: Leveraging Geometry and Shape Cues for Online Multi-Object Tracking , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Fahd Bouzaraa,et al.  Monocular Depth Estimation by Learning from Heterogeneous Datasets , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[28]  Liang Lin,et al.  Monocular Depth Estimation with Affinity, Vertical Pooling, and Label Enhancement , 2018, ECCV.

[29]  G. Ros,et al.  Visual SLAM for Driverless Cars : A Brief Survey , 2012 .

[30]  Andreas Geiger,et al.  Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art , 2017, Found. Trends Comput. Graph. Vis..

[31]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[32]  Eder Santana,et al.  Learning a Driving Simulator , 2016, ArXiv.

[33]  Andreas Geiger,et al.  Conditional Affordance Learning for Driving in Urban Environments , 2018, CoRL.

[34]  Thomas Brox,et al.  Pixel-Level Encoding and Depth Layering for Instance-Level Semantic Labeling , 2016, GCPR.

[35]  Cristiano Premebida,et al.  Pedestrian detection combining RGB and dense LIDAR data , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[36]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Silvio Savarese,et al.  Learning to Track: Online Multi-object Tracking by Decision Making , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Marc Pollefeys,et al.  Slanted Stixels: Representing San Francisco's Steepest Streets , 2017, BMVC.

[39]  Sebastien Glaser,et al.  Simultaneous Localization and Mapping: A Survey of Current Trends in Autonomous Driving , 2017, IEEE Transactions on Intelligent Vehicles.

[40]  Thomas Schamm,et al.  Autonomous driving , 2015, it Inf. Technol..

[41]  Chengyang Li,et al.  Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection , 2018, Pattern Recognit..

[42]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Zhijie Liu,et al.  Dense 3D Semantic SLAM of traffic environment based on stereo vision , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[44]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[45]  Klaus C. J. Dietmayer,et al.  Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges , 2019, IEEE Transactions on Intelligent Transportation Systems.

[46]  Uwe Franke,et al.  The Stixel World - A Compact Medium Level Representation of the 3D-World , 2009, DAGM-Symposium.

[47]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Johann Marius Zöllner,et al.  Adding navigation to the equation: Turning decisions for end-to-end vehicle control , 2017, 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC).

[50]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Lennart Svensson,et al.  Imitation learning for vision-based lane keeping assistance , 2017, 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC).

[52]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Alex Bewley,et al.  Learning to Drive from Simulation without Real World Labels , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[54]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[55]  Hesham M. Eraqi,et al.  End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies , 2017, ArXiv.

[56]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[57]  He He,et al.  Imitation Learning by Coaching , 2012, NIPS.

[58]  Sergey Levine,et al.  Deep Imitative Models for Flexible Inference, Planning, and Control , 2018, ICLR.

[59]  David Vázquez,et al.  On-Board Object Detection: Multicue, Multimodal, and Multiview Random Forest of Local Experts , 2017, IEEE Transactions on Cybernetics.

[60]  Baoli Li,et al.  Traffic-Sign Detection and Classification in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[62]  Bin Yang,et al.  PIXOR: Real-time 3D Object Detection from Point Clouds , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  B. Leibe,et al.  Taking Mobile Multi-object Tracking to the Next Level: People, Unknown Objects, and Carried Items , 2012, ECCV.

[64]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Christos Dimitrakakis,et al.  TORCS, The Open Racing Car Simulator , 2005 .

[66]  Jiebo Luo,et al.  End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perceptions , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[67]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68]  Jiaolong Xu,et al.  Pedestrian Detection at Day/Night Time with Visible and FIR Cameras: A Comparison , 2016, Sensors.

[69]  Tao Liu,et al.  A 3D Object Detection Based on Multi-Modality Sensors of USV , 2019, Applied Sciences.

[70]  Bin Wang,et al.  Siamese-ResNet: Implementing Loop Closure Detection based on Siamese Network , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[71]  Marc Pollefeys,et al.  Semantic Stixels: Depth is not enough , 2016, 2016 IEEE Intelligent Vehicles Symposium (IV).

[72]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Klaus C. J. Dietmayer,et al.  Optimal Sensor Data Fusion Architecture for Object Detection in Adverse Weather Conditions , 2018, 2018 21st International Conference on Information Fusion (FUSION).

[74]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[75]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Vladlen Koltun,et al.  On Offline Evaluation of Vision-based Driving Models , 2018, ECCV.

[78]  Marc Pollefeys,et al.  Multimodal Neural Networks: RGB-D for Semantic Segmentation and Object Detection , 2017, SCIA.

[79]  Yang Gao,et al.  End-to-End Learning of Driving Models from Large-Scale Video Datasets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Javier Alonso-Mora,et al.  Planning and Decision-Making for Autonomous Vehicles , 2018, Annu. Rev. Control. Robotics Auton. Syst..

[81]  Eric P. Xing,et al.  CIRL: Controllable Imitative Reinforcement Learning for Vision-based Self-driving , 2018, ECCV.

[82]  Nidhi Kalra,et al.  Driving to Safety , 2016 .

[83]  Rudolf Mester,et al.  Mono-Stixels: Monocular Depth Reconstruction of Dynamic Street Scenes , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[84]  Torsten Schön,et al.  Towards Self-Supervised High Level Sensor Fusion , 2019, ArXiv.

[85]  Paulo Peixoto,et al.  Multimodal vehicle detection: fusing 3D-LIDAR and color camera data , 2017, Pattern Recognit. Lett..

[86]  Michael Felsberg,et al.  Unveiling the Power of Deep Tracking , 2018, ECCV.

[87]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[88]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[89]  Johann Marius Zöllner,et al.  Improved Semantic Stixels via Multimodal Sensor Fusion , 2018, GCPR.

[90]  Qing Wang,et al.  End-to-end driving simulation via angle branched network , 2018, ArXiv.

[91]  Yann LeCun,et al.  Off-Road Obstacle Avoidance through End-to-End Learning , 2005, NIPS.

[92]  Luc Van Gool,et al.  Stixels estimation without depth map computation , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[93]  Marc Pollefeys,et al.  The Stixel World: A medium-level representation of traffic scenes , 2017, Image Vis. Comput..

[94]  Gaetan Le-Gall,et al.  Imitation Learning for End to End Vehicle Longitudinal Control with Forward Camera , 2018, ArXiv.

[95]  Emilio Frazzoli,et al.  A Survey of Motion Planning and Control Techniques for Self-Driving Urban Vehicles , 2016, IEEE Transactions on Intelligent Vehicles.

[96]  Yunfeng Ai,et al.  Visual Place Recognition in Long-term and Large-scale Environment based on CNN Feature , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[97]  Bernard Ghanem,et al.  Driving Policy Transfer via Modularity and Abstraction , 2018, CoRL.

[98]  Xiaqing Ding,et al.  LocNet: Global Localization in 3D Point Clouds for Mobile Vehicles , 2017, 2018 IEEE Intelligent Vehicles Symposium (IV).

[99]  Xinzheng Zhang,et al.  Sensor Fusion of Monocular Cameras and Laser Rangefinders for Line-Based Simultaneous Localization and Mapping (SLAM) Tasks in Autonomous Mobile Robots , 2012, Sensors.

[100]  Thierry Chateau,et al.  Deep MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis from Monocular Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[101]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[102]  Dariu Gavrila,et al.  A Multilevel Mixture-of-Experts Framework for Pedestrian Classification , 2011, IEEE Transactions on Image Processing.

[103]  Wongun Choi,et al.  Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[104]  Nicu Sebe,et al.  Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[105]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[106]  Kyunghyun Cho,et al.  Query-Efficient Imitation Learning for End-to-End Simulated Driving , 2017, AAAI.

[107]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[108]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[109]  Amnon Shashua,et al.  On the Sample Complexity of End-to-end Training vs. Semantic Abstraction Training , 2016, ArXiv.