Weakly-Supervised Image Annotation and Segmentation with Objects and Attributes

We propose to model complex visual scenes using a non-parametric Bayesian model learned from weakly labelled images abundant on media sharing sites such as Flickr. Given weak image-level annotations of objects and attributes without locations or associations between them, our model aims to learn the appearance of object and attribute classes as well as their association on each object instance. Once learned, given an image, our model can be deployed to tackle a number of vision problems in a joint and coherent manner, including recognising objects in the scene (automatic object annotation), describing objects using their attributes (attribute prediction and association), and localising and delineating the objects (object detection and semantic segmentation). This is achieved by developing a novel Weakly Supervised Markov Random Field Stacked Indian Buffet Process (WS-MRF-SIBP) that models objects and attributes as latent factors and explicitly captures their correlations within and across superpixels. Extensive experiments on benchmark datasets demonstrate that our weakly supervised model significantly outperforms weakly supervised alternatives and is often comparable with existing strongly supervised models on a variety of tasks including semantic segmentation, automatic image annotation and retrieval based on object-attribute associations.

[1]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[2]  Tao Xiang,et al.  Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Xuming He,et al.  Superpixel Graph Label Transfer with Learned Distance Metric , 2014, ECCV.

[4]  Joachim M. Buhmann,et al.  Weakly supervised semantic segmentation with a multi-image model , 2011, 2011 International Conference on Computer Vision.

[5]  Carsten Rother,et al.  Dense Semantic Image Segmentation with Objects and Attributes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[7]  Yee Whye Teh,et al.  Variational Inference for the Indian Buffet Process , 2009, AISTATS.

[8]  Qiang Ji,et al.  A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jia Xu,et al.  Tell Me What You See and I Will Show You Where It Is , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[13]  Fei-Fei Li,et al.  Attribute Learning in Large-Scale Datasets , 2010, ECCV Workshops.

[14]  Larry S. Davis,et al.  Image ranking and retrieval based on multi-attribute queries , 2011, CVPR 2011.

[15]  Nam Nguyen,et al.  A New SVM Approach to Multi-instance Multi-label Learning , 2010, 2010 IEEE International Conference on Data Mining.

[16]  Ce Liu,et al.  Unsupervised Joint Object Discovery and Segmentation in Internet Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Hongbo Liu,et al.  Automatic Calibration Method for Driver's Head Orientation in Natural Driving Environment , 2013, IEEE Transactions on Intelligent Transportation Systems.

[18]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[19]  Huizhong Chen,et al.  Describing Clothing by Semantic Attributes , 2012, ECCV.

[20]  Cees Snoek,et al.  Attributes Make Sense on Segmented Objects , 2014, ECCV.

[21]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[22]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Zoubin Ghahramani,et al.  Correlated Non-Parametric Latent Feature Models , 2009, UAI.

[24]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Antonio Torralba,et al.  Nonparametric Scene Parsing via Label Transfer , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Thomas L. Griffiths,et al.  The Indian Buffet Process: An Introduction and Review , 2011, J. Mach. Learn. Res..

[27]  Xiaogang Wang,et al.  Random field topic model for semantic region analysis in crowded scenes from tracklets , 2011, CVPR 2011.

[28]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[30]  Terrance E. Boult,et al.  Multi-attribute spaces: Calibration for attribute fusion and similarity search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[32]  Yang Wang,et al.  A Discriminative Latent Model of Object Classes and Attributes , 2010, ECCV.

[33]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[34]  Song-Chun Zhu,et al.  Weakly Supervised Learning for Attribute Localization in Outdoor Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Gang Wang,et al.  Joint learning of visual attributes, object classes and visual saliency , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Trevor Darrell,et al.  Fully Convolutional Multi-Class Multiple Instance Learning , 2014, ICLR.

[37]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Svetlana Lazebnik,et al.  Superparsing , 2010, International Journal of Computer Vision.

[41]  Fei-Fei Li,et al.  Image Segmentation with Topic Random Field , 2010, ECCV.

[42]  Adriana Kovashka,et al.  Attribute Adaptation for Personalized Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[43]  Devi Parikh,et al.  Attribute Dominance: What Pops Out? , 2013, 2013 IEEE International Conference on Computer Vision.

[44]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[45]  Shree K. Nayar,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Describable Visual Attributes for Face Verification and Image Search , 2022 .

[46]  Jana Kosecka,et al.  Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Ronan Collobert,et al.  From image-level to pixel-level labeling with Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Li Fei-Fei,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Yejin Choi,et al.  From Large Scale Image Categorization to Entry-Level Categories , 2013, 2013 IEEE International Conference on Computer Vision.

[50]  Jian Dong,et al.  Towards Unified Object Detection and Semantic Segmentation , 2014, ECCV.

[51]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Vinod Nair,et al.  A joint learning framework for attribute models and object descriptions , 2011, 2011 International Conference on Computer Vision.

[53]  Ming-Hsuan Yang,et al.  Context Driven Scene Parsing with Attention to Rare Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Tao Xiang,et al.  Learning Multimodal Latent Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[56]  Zhi-Hua Zhou,et al.  Multi-instance multi-label learning , 2008, Artif. Intell..

[57]  Rong Jin,et al.  Large-Scale Image Annotation by Efficient and Robust Kernel Metric Learning , 2013, 2013 IEEE International Conference on Computer Vision.

[58]  Jia Xu,et al.  Learning to segment under various forms of weak supervision , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Ali Farhadi,et al.  Multi-attribute Queries: To Merge or Not to Merge? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Jing Liu,et al.  Weakly-Supervised Dual Clustering for Image Semantic Segmentation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[62]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[63]  Nuno Vasconcelos,et al.  Latent Dirichlet Allocation Models for Image Classification , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  William T. Freeman,et al.  Annotation Propagation in Large Image Databases via Dense Image Correspondence , 2012, ECCV.

[65]  Tao Xiang,et al.  Weakly Supervised Learning of Objects, Attributes and Their Associations , 2014, ECCV.

[66]  Lei Wu,et al.  Tag Completion for Image Retrieval , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Adriana Kovashka,et al.  Actively selecting annotations among objects and attributes , 2011, 2011 International Conference on Computer Vision.

[68]  Thomas Deselaers,et al.  Weakly Supervised Localization and Learning with Generic Knowledge , 2012, International Journal of Computer Vision.

[69]  Svetlana Lazebnik,et al.  Scene Parsing with Object Instance Inference Using Regions and Per-exemplar Detectors , 2015, International Journal of Computer Vision.

[70]  Joachim M. Buhmann,et al.  Weakly supervised structured output learning for semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Bill Triggs,et al.  Region Classification with Markov Field Aspect Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  I. Biederman Recognition-by-components: a theory of human image understanding. , 1987, Psychological review.

[74]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.