Baby talk: Understanding and generating simple image descriptions

We posit that visually descriptive language offers computer vision researchers both information about the world, and information about how people describe the world. The potential benefit from this source is made more significant due to the enormous amount of language data easily available today. We present a system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision. The system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.

[1]  Tamara L. Berg,et al.  names and faces. , 1982, The Physician and sportsmedicine.

[2]  David L. Sheinberg,et al.  Visual object recognition. , 1996, Annual review of neuroscience.

[3]  Jeffrey M. Zacks,et al.  Perceiving, remembering, and communicating structure in events. , 2001, Journal of experimental psychology. General.

[4]  David A. Forsyth,et al.  Clustering art , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[7]  Susan McRoy,et al.  DOGHED: A Template-Based Generator for Multimodal Dialog Systems Targeting Heterogeneous Devices , 2003, HLT-NAACL.

[8]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[9]  Pietro Perona,et al.  What do we see when we glance at a scene , 2004 .

[10]  Alexander C. Berg,et al.  Who's In the Picture , 2004, NIPS 2004.

[11]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[12]  Eduard Hovy,et al.  Template-Filtered Headline Summarization , 2004 .

[13]  Keiji Yanai,et al.  Image region entropy: a measure of "visualness" of web images associated with one concept , 2005, MULTIMEDIA '05.

[14]  Martin J. Wainwright,et al.  MAP estimation via agreement on trees: message-passing and linear programming , 2005, IEEE Transactions on Information Theory.

[15]  Martin J. Wainwright,et al.  MAP estimation via agreement on (hyper)trees: Message-passing and linear programming , 2005, ArXiv.

[16]  Keiji Yanai,et al.  Finding visual concepts by web image mining , 2006, WWW '06.

[17]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[18]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  David A. Forsyth,et al.  Animals on the Web , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[20]  Antonio Criminisi,et al.  TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context , 2007, International Journal of Computer Vision.

[21]  Antonio Criminisi,et al.  Harvesting Image Databases from the Web , 2007, ICCV.

[22]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[23]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[24]  Fei-Fei Li,et al.  OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[26]  Larry S. Davis,et al.  Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers , 2008, ECCV.

[27]  Serge J. Belongie,et al.  Object categorization using co-occurrence, location and appearance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Trevor Darrell,et al.  Unsupervised Learning of Visual Sense Models for Polysemous Words , 2008, NIPS.

[29]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Herman Stehouwer,et al.  Language Models for Contextual Error Detection and Correction , 2009 .

[31]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[32]  Raymond J. Mooney,et al.  Using closed captions to train activity recognizers that improve video retrieval , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[33]  Charless C. Fowlkes,et al.  Discriminative Models for Multi-Class Object Layout , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[34]  Andrew Zisserman,et al.  "Who are you?" - Learning person specific classifiers from video , 2009, CVPR.

[35]  Katja Markert,et al.  Learning Models for Object Recognition from Natural Language Descriptions , 2009, BMVC.

[36]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Shree K. Nayar,et al.  Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[38]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, CVPR.

[39]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[40]  Christopher D. Manning,et al.  Stanford typed dependencies manual , 2010 .

[41]  Antonio Torralba,et al.  Using the forest to see the trees: exploiting context for visual object detection and localization , 2010, CACM.

[42]  Raymond J. Mooney,et al.  Using closed captions as supervision for video activity recognition , 2010, AAAI 2010.

[43]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[44]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[45]  Yansong Feng,et al.  How Many Words Is a Picture Worth? Automatic Caption Generation for News Images , 2010, ACL.

[46]  Ahmet Aker,et al.  Generating Image Descriptions Using Dependency Relational Patterns , 2010, ACL.

[47]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[48]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[49]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[50]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[51]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[52]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.