Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes

Semantic image segmentation is an essential component of modern autonomous driving systems, as an accurate understanding of the surrounding scene is crucial to navigation and action planning. Current state-of-the-art approaches in semantic image segmentation rely on pre-trained networks that were initially developed for classifying images as a whole. While these networks exhibit outstanding recognition performance (i.e., what is visible?), they lack localization accuracy (i.e., where precisely is something located?). Therefore, additional processing steps have to be performed in order to obtain pixel-accurate segmentation masks at the full image resolution. To alleviate this problem we propose a novel ResNet-like architecture that exhibits strong localization and recognition performance. We combine multi-scale context with pixel-level accuracy by using two processing streams within our network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using residuals. Without additional processing steps and without pre-training, our approach achieves an intersection-over-union score of 71.8% on the Cityscapes dataset.

[1]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[2]  Stephen Gould,et al.  Multi-Class Segmentation with Relative Location Prior , 2008, International Journal of Computer Vision.

[3]  Pushmeet Kohli,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Roberto Cipolla,et al.  Semantic texton forests for image categorization and segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Joseph J. Lim,et al.  Recognition using regions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Jianxiong Xiao,et al.  Multiple view semantic segmentation for street view images , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[7]  Mayank Bansal,et al.  Pedestrian detection with depth-guided structure labeling , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[8]  Luc Van Gool,et al.  Segmentation-Based Urban Traffic Scene Understanding , 2009, BMVC.

[9]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Manuel Menezes de Oliveira Neto,et al.  Domain transform for edge-aware image and video processing , 2011, ACM Trans. Graph..

[11]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[12]  Graham W. Taylor,et al.  Adaptive deconvolutional networks for mid and high level feature learning , 2011, 2011 International Conference on Computer Vision.

[13]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[14]  Bastian Leibe,et al.  Joint 2D-3D temporally consistent semantic segmentation of street scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  J. Mixter Fast , 2012 .

[16]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[18]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[19]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  James M. Rehg,et al.  Joint Semantic Segmentation and 3D Reconstruction from Monocular Video , 2014, ECCV.

[21]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[22]  Gregory Shakhnarovich,et al.  Feedforward semantic segmentation with zoom-out features , 2014, CVPR.

[23]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[24]  Jian Sun,et al.  Convolutional feature masking for joint object and stuff segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Raquel Urtasun,et al.  Fully Connected Deep Structured Networks , 2015, ArXiv.

[26]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[27]  Xiaoxiao Li,et al.  Semantic Image Segmentation via Deep Parsing Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[29]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling , 2015, CVPR 2015.

[30]  Sanja Fidler,et al.  segDeepM: Exploiting segmentation and context in deep neural networks for object detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Xiangyu Zhu,et al.  Object detection by labeling superpixels , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  George Papandreou,et al.  Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[38]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[39]  Jian Sun,et al.  BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Guosheng Lin,et al.  Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Bastian Leibe,et al.  Multi-scale object candidates for generic object tracking in street scenes , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[43]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[44]  Charless C. Fowlkes,et al.  Laplacian Reconstruction and Refinement for Semantic Segmentation , 2016, ArXiv.

[45]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[46]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[47]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[48]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[50]  Anton van den Hengel,et al.  Bridging Category-level and Instance-level Semantic Image Segmentation , 2016, ArXiv.

[51]  Hao Chen,et al.  VoxResNet: Deep Voxelwise Residual Networks for Volumetric Brain Segmentation , 2016, ArXiv.

[52]  Iasonas Kokkinos,et al.  Fast, Exact and Multi-scale Inference for Semantic Image Segmentation with Deep Gaussian CRFs , 2016, ECCV.

[53]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Eugenio Culurciello,et al.  ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[55]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Alex Graves,et al.  Memory-Efficient Backpropagation Through Time , 2016, NIPS.

[57]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.