Predicting Entry-Level Categories

Entry-level categories—the labels people use to name an object—were originally defined and studied by psychologists in the 1970s and 1980s. In this paper we extend these ideas to study entry-level categories at a larger scale and to learn models that can automatically predict entry-level categories for images. Our models combine visual recognition predictions with linguistic resources like WordNet and proxies for word “naturalness” mined from the enormous amount of text on the web. We demonstrate the usefulness of our models for predicting nouns (entry-level words) associated with images by people, and for learning mappings between concepts predicted by existing visual recognition systems and entry-level concepts. In this work we make use of recent successful efforts on convolutional network models for visual recognition by training classifiers for 7404 object categories on ConvNet activation features. Results for category mapping and entry-level category prediction for images show promise for producing more natural human-like labels. We also demonstrate the potential applicability of our results to the task of image description generation.

[1]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[2]  Wei Liu,et al.  Refer-to-as Relations as Semantic Knowledge , 2015, AAAI.

[3]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[4]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[5]  Jonathan Krause,et al.  Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[7]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[8]  Lorenzo Torresani,et al.  AutoCaption: Automatic caption generation for personal photos , 2014, IEEE Winter Conference on Applications of Computer Vision.

[9]  Keiji Yanai,et al.  Mutual information of words and pictures , 2006 .

[10]  Stephen M. Kosslyn,et al.  Pictures and names: Making the connection , 1984, Cognitive Psychology.

[11]  Keiji Yanai,et al.  Probabilistic web image gathering , 2005, MIR '05.

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Eugene Charniak,et al.  Nonparametric Method for Data-driven Image Captioning , 2014, ACL.

[15]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[16]  Yejin Choi,et al.  TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[19]  E. Rosch,et al.  Cognition and Categorization , 1980 .

[20]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[21]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[22]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[23]  Jonathon Shlens,et al.  Fast, Accurate Detection of 100,000 Object Classes on a Single Machine , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[25]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[26]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Yejin Choi,et al.  From Large Scale Image Categorization to Entry-Level Categories , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[29]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[30]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[32]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Cordelia Schmid,et al.  Towards good practice in large-scale learning for image classification , 2012, CVPR.

[36]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[37]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[38]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[40]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[41]  C. V. Jawahar,et al.  Choosing Linguistics over Vision to Describe Images , 2012, AAAI.