Natural Vocabulary Emerges from Free-Form Annotations

We propose an approach for annotating object classes using free-form text written by undirected and untrained annotators. Free-form labeling is natural for annotators, they intuitively provide very specific and exhaustive labels, and no training stage is necessary. We first collect 729 labels on 15k images using 124 different annotators. Then we automatically enrich the structure of these free-form annotations by discovering a natural vocabulary of 4020 classes within them. This vocabulary represents the natural distribution of objects well and is learned directly from data, instead of being an educated guess done before collecting any labels. Hence, the natural vocabulary emerges from a large mass of free-form annotations. To do so, we (i) map the raw input strings to entities in an ontology of physical objects (which gives them an unambiguous meaning); and (ii) leverage inter-annotator co-occurrences, as well as biases and knowledge specific to individual annotators. Finally, we also automatically extract natural vocabularies of reduced size that have high object coverage while remaining specific. These reduced vocabularies represent the natural distribution of objects much better than commonly used predefined vocabularies. Moreover, they feature more uniform sample distribution over classes.

[1]  Nassir Navab,et al.  Guide Me: Interacting with Deep Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[5]  Michael Gygli,et al.  Fast Object Class Labelling via Speech , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yee Whye Teh,et al.  Bayesian nonparametric crowdsourcing , 2014, J. Mach. Learn. Res..

[9]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[10]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[11]  Pietro Perona,et al.  Context Embedding Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[13]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Shin'ichi Satoh,et al.  Discriminative Learning of Open-Vocabulary Object Retrieval and Localization by Negative Phrase Augmentation , 2017, EMNLP.

[16]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Lior Rokach,et al.  Clustering Methods , 2005, The Data Mining and Knowledge Discovery Handbook.

[18]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[19]  Sanja Fidler,et al.  Teaching Machines to Describe Images via Natural Language Feedback , 2017, ArXiv.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[22]  Guoliang Li,et al.  Truth Inference in Crowdsourcing: Is the Problem Solved? , 2017, Proc. VLDB Endow..

[23]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[24]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[25]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[26]  Kees van Deemter,et al.  Natural Reference to Objects in a Visual Domain , 2010, INLG.

[27]  Bolei Zhou,et al.  Open Vocabulary Scene Parsing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Barbara Caputo,et al.  Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation , 2009, NIPS.

[29]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[30]  Geoffrey E. Hinton,et al.  Who Said What: Modeling Individual Labelers Improves Classification , 2017, AAAI.

[31]  Trevor Darrell,et al.  Open-vocabulary Object Retrieval , 2014, Robotics: Science and Systems.

[32]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Daphne Koller,et al.  Learning Spatial Context: Using Stuff to Find Things , 2008, ECCV.

[34]  Cordelia Schmid,et al.  Automatic face naming with caption-based supervision , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Yee Whye Teh,et al.  Names and faces in the news , 2004, CVPR 2004.

[36]  Thanh Tam Nguyen,et al.  Multi-Label Answer Aggregation for Crowdsourcing , 2016 .

[37]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[38]  Yejin Choi,et al.  From Large Scale Image Categorization to Entry-Level Categories , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[41]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[42]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[43]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[44]  Wei Li,et al.  Learning to discover and localize visual objects with open vocabulary , 2018, ArXiv.