Combining Text Semantics and Image Geometry to Improve Scene Interpretation

In this paper, we describe a novel system that identifies relations between the objects extracted from an image. We started from the idea that in addition to the geometric and visual properties of the image objects, we could exploit lexical and semantic information from the text accompanying the image. As experimental set up, we gathered a corpus of images from Wikipedia as well as their associated articles. We extracted two types of objects: human beings and horses and we considered three relations that could hold between them: \textit{Ride}, \textit{Lead}, or \textit{None}. We used geometric features as a baseline to identify the relations between the entities and we describe the improvements brought by the addition of bag-of-word features and predicate--argument structures we derived from the text. The best semantic model resulted in a relative error reduction of more than 18\% over the baseline.

[1]  Corinne Jörgensen,et al.  Attributes of Images in Describing Tasks , 1998, Inf. Process. Manag..

[2]  Pierre Nugues,et al.  Constructing Large Proposition Databases , 2012, LREC.

[3]  Eero Sormunen,et al.  End-User Searching Challenges Indexing Practices in the Digital Newspaper Photo Archive , 2004, Information Retrieval.

[4]  Pierre Nugues,et al.  Using Syntactic Dependencies to Solve Coreferences , 2012, EMNLP-CoNLL Shared Task.

[5]  Cristian Sminchisescu,et al.  Constrained parametric min-cuts for automatic object segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  Tamara L. Berg,et al.  Baby Talk : Understanding and Generating Image Descriptions , 2011 .

[9]  Fabio Persia,et al.  A System for Automatic Image Categorization , 2009, 2009 IEEE International Conference on Semantic Computing.

[10]  Heesoo Myeong,et al.  Learning object relationships via graph-based context model , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Marie-Francine Moens,et al.  Text Analysis for Automatic Image Annotation , 2007, ACL.

[12]  Pushmeet Kohli,et al.  Graph Cut Based Inference with Co-occurrence Statistics , 2010, ECCV.

[13]  C. V. Jawahar,et al.  Choosing Linguistics over Vision to Describe Images , 2012, AAAI.

[14]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Viktor K. Prasanna,et al.  Understanding web images by object relation network , 2012, WWW.

[16]  Cordelia Schmid,et al.  Semantic Hierarchies for Visual Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Pirkko Oittinen,et al.  Image retrieval by end-users and intermediaries in a journalistic work context , 2006, IIiX.

[18]  Shih-Fu Chang,et al.  Integration of Visual and Text-Based Approaches for the Content Labeling and Classification of Photographs , 1999, SIGIR 1999.