ReferItGame: Referring to Objects in Photographs of Natural Scenes

In this paper we introduce a new game to crowd-source natural language referring expressions. By designing a two player game, we can both collect and verify referring expressions directly within the game. To date, the game has produced a dataset containing 130,525 expressions, referring to 96,654 distinct objects, in 19,894 photographs of natural scenes. This dataset is larger and more varied than previous REG datasets and allows us to study referring expressions in real-world scenes. We provide an in depth analysis of the resulting dataset. Based on our findings, we design a new optimization based model for generating referring expressions and perform experimental evaluations on 3 test sets.

[1]  Terry Winograd,et al.  Understanding natural language , 1974 .

[2]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[3]  E. Rosch,et al.  Cognition and Categorization , 1980 .

[4]  Robert Dale,et al.  Computational Interpretations of the Gricean Maxims in the Generation of Referring Expressions , 1995, Cogn. Sci..

[5]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[6]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[7]  Siobhan Chapman Logic and Conversation , 2005 .

[8]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[9]  Ielka van der Sluis,et al.  Building a Semantically Transparent Corpus for the Generation of Referring Expressions. , 2006, INLG.

[10]  Manuel Blum,et al.  Peekaboom: a game for locating objects in images , 2006, CHI.

[11]  Dima Damen,et al.  Detecting Carried Objects in Short Video Sequences , 2008, ECCV.

[12]  Robert Dale,et al.  The Use of Spatial Relations in Referring Expression Generation , 2008, INLG.

[13]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[14]  Luis von Ahn,et al.  Word sense disambiguation via human computation , 2010, HCOMP '10.

[15]  Kees van Deemter,et al.  Natural Reference to Objects in a Visual Domain , 2010, INLG.

[16]  Robert Dale,et al.  Speaker-Dependent Variation in Content Selection for Referring Expression Generation , 2010, ALTA.

[17]  Hugo Jair Escalante,et al.  The segmented and annotated IAPR TC-12 benchmark , 2010, Comput. Vis. Image Underst..

[18]  Jeff Z. Pan,et al.  Charting the Potential of Description Logic for the Generation of Referring Expressions , 2010, INLG.

[19]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[20]  Yansong Feng,et al.  How Many Words Is a Picture Worth? Automatic Caption Generation for News Images , 2010, ACL.

[21]  Ahmet Aker,et al.  Generating Image Descriptions Using Dependency Relational Patterns , 2010, ACL.

[22]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[23]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[24]  Kees van Deemter,et al.  Two Approaches for Generating Size Modifiers , 2011, ENLG.

[25]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[26]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[27]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[28]  Emiel Krahmer,et al.  Toward a Computational Psycholinguistics of Reference Production , 2012, Top. Cogn. Sci..

[29]  Sven J. Dickinson,et al.  Video In Sentences Out , 2012, UAI.

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Cordelia Schmid,et al.  Good Practice in Large-Scale Learning for Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Emiel Krahmer,et al.  Computational Generation of Referring Expressions: A Survey , 2012, CL.

[33]  Yansong Feng,et al.  Automatic Caption Generation for News Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Yejin Choi,et al.  From Large Scale Image Categorization to Entry-Level Categories , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Jonathan Krause,et al.  Fine-Grained Crowdsourcing for Fine-Grained Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[37]  Kees van Deemter,et al.  Typicality and Object Reference , 2013, CogSci.

[38]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[39]  Kees van Deemter,et al.  Generating Expressions that Refer to Visible Objects , 2013, NAACL.

[40]  Luke S. Zettlemoyer,et al.  Learning Distributions over Logical Forms for Referring Expression Generation , 2013, EMNLP.

[41]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Changsong Liu,et al.  Towards Situated Dialogue: Revisiting Referring Expression Generation , 2013, EMNLP.

[43]  Emiel Krahmer,et al.  Graphs and Spatial Relations in the Generation of Referring Expressions , 2013, ENLG.