Error Correction for Dense Semantic Image Labeling

Pixel-wise semantic image labeling is an important, yet challenging task with many applications. Especially in autonomous driving systems, it allows for a full understanding of the system's surroundings, which is crucial for trajectory planning. Typical approaches to tackle this problem involve either the training of deep networks on vast amounts of images to directly infer the labels or the use of probabilistic graphical models to jointly model the dependencies of the input (i.e. images) and output (i.e. labels). Yet, the former approaches do not capture the structure of the output labels, which is crucial for the performance of dense labeling, and the latter rely on carefully hand-designed priors that require costly parameter tuning via optimization techniques, which in turn leads to long inference times. To alleviate these restrictions, we explore how to arrive at dense semantic pixel labels given both the input image and an initial estimate of the output labels. We propose a parallel architecture that: 1) exploits the context information through a LabelPropagation network to propagate correct labels from nearby pixels to improve the object boundaries, 2) uses a LabelReplacement network to directly replace possibly erroneous, initial labels with new ones, and 3) combines the different intermediate results via a Fusion network to obtain the final per-pixel label. We experimentally validate our approach on two different datasets for semantic segmentation, where we show improvements over the state-of-the-art. We also provide both a quantitative and qualitative analysis of the generated results.

[1]  Ronan Collobert,et al.  Learning to Refine Object Segments , 2016, ECCV.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Xiaoou Tang,et al.  Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xiaogang Wang,et al.  Deep Convolutional Network Cascade for Facial Point Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Andrew Blake,et al.  Image Segmentation by Branch-and-Mincut , 2008, ECCV.

[7]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[9]  Yu-Bin Yang,et al.  Image Restoration Using Very Deep Convolutional Encoder-Decoder Networks with Symmetric Skip Connections , 2016, NIPS.

[10]  Stanley T. Birchfield,et al.  Image-based segmentation of indoor corridor floors for a mobile robot , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Iasonas Kokkinos,et al.  Segmentation-Aware Convolutional Networks Using Local Attention Masks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Jitendra Malik,et al.  View Synthesis by Appearance Flow , 2016, ECCV.

[16]  Jitendra Malik,et al.  Iterative Instance Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[19]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[21]  Cristian Sminchisescu,et al.  Parametric Image Segmentation of Humans with Structural Shape Priors , 2016, ACCV.

[22]  Nikos Komodakis,et al.  Detect, Replace, Refine: Deep Structured Prediction for Pixel Wise Labeling , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ming-Hsuan Yang,et al.  Multi-objective convolutional learning for face labeling , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[25]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Zhe L. Lin,et al.  Exemplar-Based Face Parsing , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Vincent Lepetit,et al.  Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[28]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[29]  Kristen Grauman,et al.  Shape Sharing for Object Segmentation , 2012, ECCV.

[30]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[33]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[34]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[36]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[37]  Sinisa Todorovic,et al.  A Multi-scale CNN for Affordance Segmentation in RGB Images , 2016, ECCV.

[38]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[39]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[40]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Eugenio Culurciello,et al.  ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[44]  Heesung Jun,et al.  Vision-based location positioning using augmented reality for indoor navigation , 2008, IEEE Transactions on Consumer Electronics.

[45]  Ji Wan,et al.  Deep Learning for Content-Based Image Retrieval: A Comprehensive Study , 2014, ACM Multimedia.

[46]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[47]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.