OmniDet: Surround View Cameras Based Multi-Task Visual Perception Network for Autonomous Driving

Surround View fisheye cameras are commonly deployed in automated driving for 360$^\circ$ near-field sensing around the vehicle. This work presents a multi-task visual perception network on unrectified fisheye images to enable the vehicle to sense its surrounding environment. It consists of six primary tasks necessary for an autonomous driving system: depth estimation, visual odometry, semantic segmentation, motion segmentation, object detection, and lens soiling detection. We demonstrate that the jointly trained model performs better than the respective single task versions. Our multi-task model has a shared encoder providing a significant computational advantage and has synergized decoders where tasks support each other. We propose a novel camera geometry based adaptation mechanism to encode the fisheye distortion model both at training and inference. This was crucial to enable training on the WoodScape dataset, comprised of data from different parts of the world collected by 12 different cameras mounted on three different cars with different intrinsics and viewpoints. Given that bounding boxes is not a good representation for distorted fisheye images, we also extend object detection to use a polygon with non-uniformly sampled vertices. We additionally evaluate our model on standard automotive datasets, namely KITTI and Cityscapes. We obtain the state-of-the-art results on KITTI for depth estimation and pose estimation tasks and competitive performance on the other tasks. We perform extensive ablation studies on various architecture choices and task weighting methodologies. A short video at https://youtu.be/xbSjZ5OfPes provides qualitative results.

[1]  Hang Su,et al.  Pixel-Adaptive Convolutional Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andrew J. Davison,et al.  End-To-End Multi-Task Learning With Attention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  David H. Douglas,et al.  ALGORITHMS FOR THE REDUCTION OF THE NUMBER OF POINTS REQUIRED TO REPRESENT A DIGITIZED LINE OR ITS CARICATURE , 1973 .

[4]  Zhao Chen,et al.  GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.

[5]  Rares Ambrus,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[8]  Wolfram Burgard,et al.  SMSnet: Semantic motion segmentation using deep convolutional neural networks , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[9]  Patrick Mäder,et al.  UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a Generic Framework for Handling Common Camera Distortion Models , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Senthil Yogamani,et al.  MultiNet++: Multi-Stream Feature Aggregation and Geometric Loss Strategy for Multi-Task Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Chang Shu,et al.  Feature-metric Loss for Self-supervised Learning of Depth and Egomotion , 2020, ECCV.

[12]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Anelia Angelova,et al.  Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos , 2018, AAAI.

[14]  Li Fei-Fei,et al.  Dynamic Task Prioritization for Multitask Learning , 2018, ECCV.

[15]  Patrick Mäder,et al.  FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Stefan Milz,et al.  SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation Synergized with Semantic Segmentation for Autonomous Driving , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Stefan Milz,et al.  WoodScape: A Multi-Task, Multi-Camera Fisheye Dataset for Autonomous Driving , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Senthil Yogamani,et al.  FisheyeMultiNet: Real-time Multi-task Learning Architecture for Surround-view Automated Parking System , 2019, ArXiv.

[19]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[20]  Luc Van Gool,et al.  Multi-Task Learning for Dense Prediction Tasks: A Survey. , 2020, IEEE transactions on pattern analysis and machine intelligence.

[21]  Martin Jägersand,et al.  MODNet: Motion and Appearance based Moving Object Detection Network for Autonomous Driving , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[22]  Thomas Brox,et al.  CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Vladlen Koltun,et al.  Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Baishakhi Ray,et al.  Multitask Learning Strengthens Adversarial Robustness , 2020, ECCV.

[25]  Senthil Yogamani,et al.  Generalized Object Detection on Fisheye Cameras for Autonomous Driving: Dataset, Representations and Baseline , 2020, ArXiv.

[26]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[27]  Senthil Yogamani,et al.  NeurAll: Towards a Unified Visual Perception Model for Automated Driving , 2019, 2019 IEEE Intelligent Transportation Systems Conference (ITSC).

[28]  Michael J. Black,et al.  Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Cordelia Schmid,et al.  Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Roland T. Chin,et al.  On the Detection of Dominant Points on Digital Curves , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Jonathan T. Barron,et al.  A General and Adaptive Robust Loss Function , 2017, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[37]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[38]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[40]  Senthil Yogamani,et al.  SoilingNet: Soiling Detection on Automotive Surround-View Cameras , 2019, 2019 IEEE Intelligent Transportation Systems Conference (ITSC).

[41]  Roberto Cipolla,et al.  MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving , 2016, 2018 IEEE Intelligent Vehicles Symposium (IV).

[42]  Matthew B. Blaschko,et al.  The Lovasz-Softmax Loss: A Tractable Surrogate for the Optimization of the Intersection-Over-Union Measure in Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[44]  Marek Vajgl,et al.  Poly-YOLO: higher speed, more precise detection and instance segmentation for YOLOv3 , 2020, Neural Comput. Appl..

[45]  Jason Yosinski,et al.  An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , 2018, NeurIPS.

[46]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[47]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).