OCNet: Object Context Network for Scene Parsing

In this paper, we address the problem of scene parsing with deep learning and focus on the context aggregation strategy for robust segmentation. Motivated by that the label of a pixel is the category of the object that the pixel belongs to, we introduce an \emph{object context pooling (OCP)} scheme, which represents each pixel by exploiting the set of pixels that belong to the same object category with such a pixel, and we call the set of pixels as object context. Our implementation, inspired by the self-attention approach, consists of two steps: (i) compute the similarities between each pixel and all the pixels, forming a so-called object context map for each pixel served as a surrogate for the true object context, and (ii) represent the pixel by aggregating the features of all the pixels weighted by the similarities. The resulting representation is more robust compared to existing context aggregation schemes, e.g., pyramid pooling modules (PPM) in PSPNet and atrous spatial pyramid pooling (ASPP), which do not differentiate the context pixels belonging to the same object category or not, making the reliability of contextually aggregated representations limited. We empirically demonstrate our approach and two pyramid extensions with state-of-the-art performance on three semantic segmentation benchmarks: Cityscapes, ADE20K and LIP. Code has been made available at: this https URL.

[1]  Shu Kong,et al.  Recurrent Scene Parsing with Perspective Understanding in the Loop , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Yi Zhang,et al.  PSANet: Point-wise Spatial Attention Network for Scene Parsing , 2018, ECCV.

[3]  Shuicheng Yan,et al.  Mutual Learning to Adapt for Joint Human Parsing and Pose Estimation , 2018, ECCV.

[4]  Ke Gong,et al.  Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[7]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Garrison W. Cottrell,et al.  Understanding Convolution for Semantic Segmentation , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Shiguang Shan,et al.  Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Sheng Tang,et al.  Scale-Adaptive Convolutions for Scene Parsing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[13]  Stella X. Yu,et al.  Adaptive Affinity Fields for Semantic Segmentation , 2018, ECCV.

[14]  Fang Zhao,et al.  Self-Supervised Neural Aggregation Networks for Human Parsing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Kun Yu,et al.  DenseASPP for Semantic Segmentation in Street Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[21]  Yi Yang,et al.  Macro-Micro Adversarial Network for Human Parsing , 2018, ECCV.

[22]  Davide Modolo,et al.  Objects as Context for Detecting Their Semantic Parts , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[25]  Anton van den Hengel,et al.  High-performance Semantic Segmentation Using Very Deep Fully Convolutional Networks , 2016, ArXiv.

[26]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[27]  Yunchao Wei,et al.  Devil in the Details: Towards Accurate Single and Multiple Human Parsing , 2018, AAAI.

[28]  Lorenzo Porzi,et al.  In-place Activated BatchNorm for Memory-Optimized Training of DNNs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Liang Lin,et al.  Look into Person: Joint Body Parsing & Pose Estimation Network and a New Benchmark , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Gang Yu,et al.  BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation , 2018, ECCV.

[34]  Eric P. Xing,et al.  Dynamic-Structured Semantic Propagation Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[36]  Xiaogang Wang,et al.  Context Encoding for Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[38]  Gang Yu,et al.  Learning a Discriminative Feature Network for Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.