On Available Corpora for Empirical Methods in Vision & Language

Integrating vision and language has long been a dream in work on artificial intelligence (AI). In the past two years, we have witnessed an explosion of work that brings together vision and language from images to videos and beyond. The available corpora have played a crucial role in advancing this area of research. In this paper, we propose a set of quality metrics for evaluating and analyzing the vision & language datasets and classify them accordingly. Our analyses show that the most recent datasets have been using more complex language and more abstract concepts, however, there are different strengths and weaknesses in each.

[1]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[2]  Matthias Scheutz,et al.  Robust spoken instruction understanding for HRI , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[3]  Luke S. Zettlemoyer,et al.  A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[4]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[5]  Raymond J. Mooney,et al.  Training a Multilingual Sportscaster: Using Perceptual Context to Learn Language , 2014, J. Artif. Intell. Res..

[6]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[7]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[8]  Licheng Yu,et al.  Visual Madlibs: Fill in the blank Image Generation and Question Answering , 2015, ArXiv.

[9]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[10]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[11]  Barry K. Rosen,et al.  Syntactic Complexity , 1974, Inf. Control..

[12]  Lucy Vanderwende,et al.  Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Victor H. Yngve,et al.  A model and an hypothesis for language structure , 1960 .

[14]  Luke S. Zettlemoyer,et al.  See No Evil, Say No Evil: Description Generation from Densely Labeled Images , 2014, *SEMEVAL.

[15]  Deb Roy,et al.  Conversational Robots: Building Blocks for Grounding Word Meaning , 2003, HLT-NAACL 2003.

[16]  Benjamin Van Durme,et al.  Reporting Bias and Knowledge Extraction , 2013 .

[17]  Gunhee Kim,et al.  Joint photo stream and blog post summarization and exploration , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[19]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[20]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[22]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  DarrellTrevor,et al.  Long-Term Recurrent Convolutional Networks for Visual Recognition and Description , 2017 .

[24]  Henrik I. Christensen,et al.  Situated Dialogue and Spatial Organization: What, Where… and Why? , 2007 .

[25]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Terry Winograd,et al.  Understanding natural language , 1974 .

[27]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[28]  Arul Menezes,et al.  MindNet: An Automatically-Created Lexical Resource , 2005, HLT.

[29]  Jiebo Luo,et al.  Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments , 2015, HLT-NAACL.

[30]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[31]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[32]  Kevin Murphy,et al.  What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[33]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[34]  Jayant Krishnamurthy,et al.  Toward Interactive Grounded Language Acqusition , 2013, Robotics: Science and Systems.

[35]  Yejin Choi,et al.  Déjà Image-Captions: A Corpus of Expressive Descriptions in Repetition , 2015, NAACL.

[36]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.