Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions

We investigate the role that geometric, textual and visual features play in the task of predicting a preposition that links two visual entities depicted in an image. The task is an important part of the subsequent process of generating image descriptions. We explore the prediction of prepositions for a pair of entities, both in the case when the labels of such entities are known and unknown. In all situations we found clear evidence that all three features contribute to the prediction task.

[1]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[2]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[3]  Kenneth C. Litkowski,et al.  SemEval-2007 Task 06: Word-Sense Disambiguation of Prepositions , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  Larry S. Davis,et al.  Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers , 2008, ECCV.

[6]  Marie-Francine Moens,et al.  Spatial role labeling: Towards extraction of spatial relations from natural language , 2011, TSLP.

[7]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[8]  Tamara L. Berg,et al.  Baby Talk: Understanding and Generating Image Descriptions , 2011 .

[9]  A. Kilgarriff *SEM 2012: The First Joint Conference on Lexical and Computational Semantics , 2012 .

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[12]  Sanja Fidler,et al.  A Sentence Is Worth a Thousand Pixels , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Marie-Francine Moens,et al.  SemEval-2012 Task 3: Spatial Role Labeling , 2012, SemEval@NAACL-HLT.

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Joakim Nivre,et al.  Universal Stanford dependencies: A cross-linguistic typology , 2014, LREC.

[16]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[17]  Raffaella Bernardi,et al.  TUHOI: Trento Universal Human Object Interaction Dataset , 2014, VL@COLING.

[18]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[21]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Josiah Wang,et al.  Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions , 2015, EMNLP.

[23]  James Pustejovsky,et al.  SemEval-2015 Task 8: SpaceEval , 2015, *SEMEVAL.