Depth Augmented Semantic Segmentation Networks for Automated Driving

In this paper, we explore the augmentation of depth maps to improve the performance of semantic segmentation motivated by the geometric structure in automotive scenes. Typically depth is already computed in an automotive system to localize objects and path planning and thus can be leveraged for semantic segmentation. We construct two networks that serve as a baseline for comparison which are “RGB only” and “Depth only”, and we investigate the impact of fusion of both cues using another two networks which are “RGBD concat”, and “Two Stream RGB+D”. We evaluate these networks on two automotive datasets namely Virtual KITTI using synthetic depth and Cityscapes using a standard stereo depth estimation algorithm. Additionally, we evaluate our approach using monoDepth unsupervised estimator [10]. Two-stream architecture achieves the best results with an improvement of 5.7% IoU in Virtual KITTI and 1% IoU in Cityscapes. There is a large improvement for certain classes like trucks, building, van and cars which have an increase of 29%, 11%, 9% and 8% respectively in Virtual KITTI. Surprisingly, CNN model is able to produce good semantic segmentation from depth images only. The proposed network runs at 4 fps on TitanX GPU, Maxwell architecture.

[1]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[2]  Ulrich Neumann,et al.  Depth-aware CNN for RGB-D Segmentation , 2018, ECCV.

[3]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4]  John McDonald,et al.  Vision-Based Driver Assistance Systems: Survey, Taxonomy and Advances , 2015, 2015 IEEE 18th International Conference on Intelligent Transportation Systems.

[5]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[7]  Stefan Leutenegger,et al.  ElasticFusion: Dense SLAM Without A Pose Graph , 2015, Robotics: Science and Systems.

[8]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  James M. Rehg,et al.  Joint Semantic Segmentation and 3D Reconstruction from Monocular Video , 2014, ECCV.

[10]  Martin Jägersand,et al.  Deep semantic segmentation for automated driving: Taxonomy, roadmap and challenges , 2017, 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC).

[11]  Martin Jägersand,et al.  MODNet: Moving Object Detection Network with Motion and Appearance for Autonomous Driving , 2017, ArXiv.

[12]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[13]  Marc Pollefeys,et al.  The Stixel World: A medium-level representation of traffic scenes , 2017, Image Vis. Comput..

[14]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[20]  H. Hirschmüller Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information , 2005, CVPR.

[21]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Daniel Cohen-Or,et al.  Cascaded Feature Network for Semantic Segmentation of RGB-D Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Jörg Stückler,et al.  Multi-view deep learning for consistent semantic mapping with RGB-D cameras , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[24]  Guo-Jun Qi,et al.  Hierarchically Gated Deep Networks for Semantic Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Heng Tao Shen,et al.  Exploiting Depth From Single Monocular Images for Object Detection and Semantic Segmentation , 2016, IEEE Transactions on Image Processing.

[27]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Daniel Cremers,et al.  FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture , 2016, ACCV.