A top-down manner-based DCNN architecture for semantic image segmentation

Given their powerful feature representation for recognition, deep convolutional neural networks (DCNNs) have been driving rapid advances in high-level computer vision tasks. However, their performance in semantic image segmentation is still not satisfactory. Based on the analysis of visual mechanism, we conclude that DCNNs in a bottom-up manner are not enough, because semantic image segmentation task requires not only recognition but also visual attention capability. In the study, superpixels containing visual attention information are introduced in a top-down manner, and an extensible architecture is proposed to improve the segmentation results of current DCNN-based methods. We employ the current state-of-the-art fully convolutional network (FCN) and FCN with conditional random field (DeepLab-CRF) as baselines to validate our architecture. Experimental results of the PASCAL VOC segmentation task qualitatively show that coarse edges and error segmentation results are well improved. We also quantitatively obtain about 2%-3% intersection over union (IOU) accuracy improvement on the PASCAL VOC 2011 and 2012 test sets.

[1]  Leslie G. Ungerleider,et al.  Mechanisms of visual attention in the human cortex. , 2000, Annual review of neuroscience.

[2]  Diane M. Beck,et al.  Top-down and bottom-up mechanisms in biasing competition in the human brain , 2009, Vision Research.

[3]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[4]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Zhuowen Tu,et al.  Convolutional Pseudo-Prior for Structured Labeling , 2015, ArXiv.

[9]  Shuicheng Yan,et al.  Human Parsing with Contextualized Convolutional Neural Network , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[12]  Shuicheng Yan,et al.  Robust Neighborhood Preserving Projection by Nuclear/L2,1-Norm Regularization for Image Feature Extraction , 2017, IEEE Transactions on Image Processing.

[13]  Ming-Hsuan Yang,et al.  Context Driven Scene Parsing with Attention to Rare Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[15]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  智一 吉田,et al.  Efficient Graph-Based Image Segmentationを用いた圃場図自動作成手法の検討 , 2014 .

[17]  Jian Sun,et al.  Guided Image Filtering , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  D. J. Felleman,et al.  Distributed hierarchical processing in the primate cerebral cortex. , 1991, Cerebral cortex.

[19]  Gregory Shakhnarovich,et al.  Feedforward semantic segmentation with zoom-out features , 2014, CVPR.

[20]  Keiji Tanaka,et al.  Matching Categorical Object Representations in Inferior Temporal Cortex of Man and Monkey , 2008, Neuron.

[21]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[22]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[23]  Ha Hong,et al.  Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream , 2013, NIPS.

[24]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[25]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[26]  Zhenfeng Zhu,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline. , 2017, IEEE transactions on cybernetics.

[27]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Svetlana Lazebnik,et al.  Scene Parsing with Object Instance Inference Using Regions and Per-exemplar Detectors , 2015, International Journal of Computer Vision.

[29]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  M. Carrasco Visual attention: The past 25 years , 2011, Vision Research.

[31]  N. Kanwisher,et al.  The lateral occipital complex and its role in object recognition , 2001, Vision Research.

[32]  Ling Shao,et al.  Real-Time Superpixel Segmentation by DBSCAN Clustering Algorithm , 2016, IEEE Transactions on Image Processing.

[33]  T. Moore,et al.  Neural Mechanisms of Selective Visual Attention. , 2017, Annual review of psychology.

[34]  M. Tarr,et al.  Visual Object Recognition , 1996, ISTCS.

[35]  Cristian Sminchisescu,et al.  CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[37]  E. DeYoe,et al.  Concurrent processing in the primate visual cortex. , 1995 .

[38]  Leslie G. Ungerleider,et al.  ‘What’ and ‘where’ in the human brain , 1994, Current Opinion in Neurobiology.

[39]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Haijun Zhang,et al.  Understanding Subtitles by Character-Level Sequence-to-Sequence Learning , 2017, IEEE Transactions on Industrial Informatics.

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  R. Desimone Visual attention mediated by biased competition in extrastriate visual cortex. , 1998, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[43]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[44]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  R. Malach,et al.  The topography of high-order human object areas , 2002, Trends in Cognitive Sciences.

[46]  S. Kastner,et al.  Interactions of Top-Down and Bottom-Up Mechanisms in Human Visual Cortex , 2011, The Journal of Neuroscience.

[47]  Tommy W. S. Chow,et al.  Object-Level Video Advertising: An Optimization Framework , 2017, IEEE Transactions on Industrial Informatics.

[48]  Tommy W. S. Chow,et al.  Sparse Codes Auto-Extractor for Classification: A Joint Embedding and Dictionary Learning Framework for Representation , 2016, IEEE Transactions on Signal Processing.

[49]  R. Desimone,et al.  Neural mechanisms of selective visual attention. , 1995, Annual review of neuroscience.

[50]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[51]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  David L. Sheinberg,et al.  Visual object recognition. , 1996, Annual review of neuroscience.

[53]  Yan Zhang,et al.  Discriminative sparse flexible manifold embedding with novel graph for robust visual representation and label propagation , 2017, Pattern Recognit..