Semantic Amodal Segmentation

Common visual recognition tasks such as classification, object detection, and semantic segmentation are rapidly reaching maturity, and given the recent rate of progress, it is not unreasonable to conjecture that techniques for many of these problems will approach human levels of performance in the next few years. In this paper we look to the future: what is the next frontier in visual recognition? We offer one possible answer to this question. We propose a detailed image annotation that captures information beyond the visible pixels and requires complex reasoning about full scene structure. Specifically, we create an amodal segmentation of each image: the full extent of each region is marked, not just the visible pixels. Annotators outline and name all salient regions in the image and specify a partial depth order. The result is a rich scene structure, including visible and occluded portions of each region, figure-ground edge information, semantic labels, and object overlap. We create two datasets for semantic amodal segmentation. First, we label 500 images in the BSDS dataset with multiple annotators per image, allowing us to study the statistics of human annotations. We show that the proposed full scene annotation is surprisingly consistent between annotators, including for regions and edges. Second, we annotate 5000 images from COCO. This larger dataset allows us to explore a number of algorithmic ideas for amodal segmentation and depth ordering. We introduce novel metrics for these tasks, and along with our strong baselines, define concrete new challenges for the community.

[1]  C. Lawrence Zitnick,et al.  Fast Edge Detection Using Structured Forests , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Ronan Collobert,et al.  Learning to Segment Object Candidates , 2015, NIPS.

[5]  Yi Yang,et al.  Layered object detection for multi-class segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Pushmeet Kohli,et al.  A Contour Completion Model for Augmenting Surface Reconstructions , 2014, ECCV.

[7]  S. Palmer,et al.  A century of Gestalt psychology in visual perception: I. Perceptual grouping and figure-ground organization. , 2012, Psychological bulletin.

[8]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[9]  Stanley M. Bileschi,et al.  Street Scenes: towards scene understanding in still images , 2006 .

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[12]  Bernt Schiele,et al.  What Makes for Effective Detection Proposals? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[14]  Svetlana Lazebnik,et al.  Scene Parsing with Object Instances and Occlusion Ordering , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[16]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[17]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[18]  Ronan Collobert,et al.  Learning to Refine Object Segments , 2016, ECCV.

[19]  Noah Snavely,et al.  OpenSurfaces , 2013, ACM Trans. Graph..

[20]  Jitendra Malik,et al.  Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[22]  Derek Hoiem,et al.  Beyond the Line of Sight: Labeling the Underlying Surfaces , 2012, ECCV.

[23]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jia Deng,et al.  A large-scale hierarchical image database , 2009, CVPR 2009.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Hayit Greenspan,et al.  Finding Pictures of Objects in Large Collections of Images , 1996, Object Representation in Computer Vision.

[29]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Pietro Perona,et al.  Hierarchical Scene Annotation , 2013, BMVC.

[31]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[32]  Jitendra Malik,et al.  Amodal Completion and Size Constancy in Natural Scenes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Antonio Torralba,et al.  Nonparametric Scene Parsing via Label Transfer , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[35]  ImageNet Classification with Deep Convolutional Neural , 2013 .

[36]  Zhuowen Tu,et al.  Supervised Learning of Edges and Object Boundaries , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[37]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[38]  Jitendra Malik,et al.  Local figure-ground cues are valid for natural images. , 2007, Journal of vision.

[39]  H. Barlow Vision Science: Photons to Phenomenology by Stephen E. Palmer , 2000, Trends in Cognitive Sciences.

[40]  M. Landy,et al.  The Plenoptic Function and the Elements of Early Vision , 1991 .

[41]  TorralbaAntonio,et al.  Nonparametric Scene Parsing via Label Transfer , 2011 .

[42]  G. Kanizsa,et al.  Organization in Vision: Essays on Gestalt Perception , 1979 .

[43]  Antonio Criminisi,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[44]  Roberto Cipolla,et al.  Semantic texton forests for image categorization and segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Jitendra Malik,et al.  Amodal Instance Segmentation , 2016, ECCV.

[46]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.