Object Instance Segmentation and Fine-Grained Localization Using Hypercolumns

Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as a feature representation. However, the information in this layer may be too coarse spatially to allow precise localization. On the contrary, earlier layers may be precise in localization but will not capture semantics. To get the best of both worlds, we define the <italic>hypercolumn</italic> at a pixel as the vector of activations of all CNN units above that pixel. Using hypercolumns as pixel descriptors, we show results on three fine-grained localization tasks: simultaneous detection and segmentation, where we improve state-of-the-art from <italic>49.7</italic> mean <inline-formula><tex-math notation="LaTeX">$\mathrm{AP}^r$</tex-math><alternatives> <inline-graphic xlink:href="hariharan-ieq1-2578328.gif"/></alternatives></inline-formula> to <italic>62.4</italic>, keypoint localization, where we get a 3.3 point boost over a strong regression baseline using CNN features, and part labeling, where we show a 6.6 point gain over a strong baseline.

[1]  Yi Yang,et al.  Layered Object Models for Image Segmentation , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[4]  Pietro Perona,et al.  Reconstructive Sparse Code Transfer for Contour Detection and Semantic Labeling , 2014, ACCV.

[5]  Alan L. Yuille,et al.  Semantic part segmentation using compositional model combining shape and appearance , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  P Perona,et al.  Preattentive texture discrimination with early vision mechanisms. , 1990, Journal of the Optical Society of America. A, Optics and image science.

[7]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[8]  Tamara L. Berg,et al.  Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Jitendra Malik,et al.  Using k-Poselets for Detecting People and Localizing Their Keypoints , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12]  Gregory Shakhnarovich,et al.  Feedforward semantic segmentation with zoom-out features , 2014, CVPR.

[13]  Luis E. Ortiz,et al.  Parsing clothing in fashion photographs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[15]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  J. Koenderink,et al.  Representation of local geometry in the visual system , 1987, Biological Cybernetics.

[17]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jian Sun,et al.  Convolutional feature masking for joint object and stuff segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jitendra Malik,et al.  R-CNNs for Pose Estimation and Action Detection , 2014, ArXiv.

[20]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[21]  Derek Hoiem,et al.  Learning to localize detected objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Jitendra Malik,et al.  Volumetric Semantic Segmentation Using Pyramid Context Features , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Charless C. Fowlkes,et al.  Shape-based pedestrian parsing , 2011, CVPR 2011.

[24]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Edward H. Adelson,et al.  The Laplacian Pyramid as a Compact Image Code , 1983, IEEE Trans. Commun..

[26]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[27]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[28]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Alan L. Yuille,et al.  Joint Object and Part Segmentation Using Deep Learned Potentials , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Shimon Ullman,et al.  Class-Specific, Top-Down Segmentation , 2002, ECCV.

[32]  Yann LeCun,et al.  Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[34]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[36]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[37]  Sanja Fidler,et al.  Bottom-Up Segmentation for Top-Down Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[40]  Koen E. A. van de Sande,et al.  Segmentation as selective search for object recognition , 2011, 2011 International Conference on Computer Vision.

[41]  Sanja Fidler,et al.  Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Jitendra Malik,et al.  Determining Three-Dimensional Shape from Orientation and Spatial Frequency Disparities , 1991, ECCV.

[43]  Pietro Perona,et al.  Improved Bird Species Recognition Using Pose Normalized Deep Convolutional Nets , 2014, BMVC.

[44]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[45]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46]  Jian Dong,et al.  Towards Unified Object Detection and Semantic Segmentation , 2014, ECCV.

[47]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[48]  Subhransu Maji,et al.  Detecting People Using Mutually Consistent Poselet Activations , 2010, ECCV.

[49]  Jitendra Malik,et al.  Robust computation of optical flow in a multi-scale differential framework , 2005, International Journal of Computer Vision.

[50]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Xiaogang Wang,et al.  Pedestrian Parsing via Deep Decompositional Network , 2013, 2013 IEEE International Conference on Computer Vision.

[52]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Cristian Sminchisescu,et al.  Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  C. V. Jawahar,et al.  The truth about cats and dogs , 2011, 2011 International Conference on Computer Vision.