论文信息 - Every Picture Tells a Story: Generating Sentences from Images

Every Picture Tells a Story: Generating Sentences from Images

Humans can prepare concise descriptions of pictures, focusing on what they find important. We demonstrate that automatic methods can do so too. We describe a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate of meaning obtained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned us-ingdata. We evaluate on a novel dataset consisting of human-annotated images. While our underlying estimate of meaning is impoverished, it is sufficient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche.

[1] F. Quimby. What's in a picture? , 1993, Laboratory animal science.

[2] Dekang Lin,et al. An Information-Theoretic Definition of Similarity , 1998, ICML.

[3] Y. Mori,et al. Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[4] Richard Sproat,et al. WordsEye: an automatic text-to-scene conversion system , 2001, SIGGRAPH.

[5] David A. Forsyth,et al. Clustering art , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[6] Mads Nielsen,et al. Computer Vision — ECCV 2002 , 2002, Lecture Notes in Computer Science.

[7] David A. Forsyth,et al. Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[8] P. Jonathon Phillips,et al. Meta-analysis of face recognition algorithms , 2001, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[9] Antonio Torralba,et al. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[10] James Ze Wang,et al. Content-based image retrieval: approaches and trends of the new age , 2005, MIR '05.

[11] Ben Taskar,et al. Learning structured prediction models: a large margin approach , 2005, ICML.

[12] Nathan D. Ratliff,et al. Subgradient Methods for Maximum Margin Structured Learning , 2006 .

[13] Antonio Torralba,et al. Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[14] Johan Bos,et al. Linguistically Motivated Large-Scale NLP with C&C and Boxer , 2007, ACL.

[15] Fei-Fei Li,et al. What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16] Larry S. Davis,et al. Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Andrew J. Davison,et al. Active Matching , 2008, ECCV.

[18] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Thomas Mensink,et al. Improving People Search Using Query Expansions , 2008, ECCV.

[20] Thomas Mensink,et al. Improving People Search Using Query Expansions , 2008, ECCV.

[21] David A. McAllester,et al. A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Larry S. Davis,et al. Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers , 2008, ECCV.

[23] Derek Hoiem,et al. Pascal VOC 2008 Challenge , 2008 .

[24] Barbara Caputo,et al. Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation , 2009, NIPS.

[25] Larry S. Davis,et al. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Larry S. Davis,et al. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Li Fei-Fei,et al. Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28] Cecilia Ovesdotter Alm,et al. Object Categorization: Words and Pictures: Categories, Modifiers, Depiction, and Iconography , 2009 .

[29] Liang Lin,et al. I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[30] Fei-Fei Li,et al. Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31] Cyrus Rashtchian,et al. Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.