Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries

The widespread integration of cameras in hand-held and head-worn devices and the ability to share content online enables a large and diverse visual capture of the world that millions of users build up collectively every day. We envision these images as well as associated meta information, such as GPS coordinates and timestamps, to form a collective visual memory that can be queried while automatically taking the ever-changing context of mobile users into account. As a first step towards this vision, in this work we present Xplore-M-Ego: a novel media retrieval system that allows users to query a dynamic database of images using spatio-temporal natural language queries. We evaluate our system using a new dataset of real image queries as well as through a usability study. One key finding is that there is a considerable amount of inter-user variability in the resolution of spatial relations in natural language utterances. We show that our system can cope with this variability using personalisation through an online learning-based retrieval formulation.

[1]  Mun Wai Lee,et al.  Semantic video search using natural language queries , 2009, MM '09.

[2]  Özgür Ulusoy,et al.  A Natural Language-Based Interface for Querying a Video Database , 2007, IEEE MultiMedia.

[3]  Jan Kautz,et al.  Video collections in panoramic contexts , 2013, UIST.

[4]  Jonathan Berant,et al.  Semantic Parsing via Paraphrasing , 2014, ACL.

[5]  Laura A. Carlson,et al.  Grounding spatial language in perception: an empirical and computational investigation. , 2001, Journal of experimental psychology. General.

[6]  Mario Fritz,et al.  Towards a Visual Turing Challenge , 2014, ArXiv.

[7]  Nick Cercone,et al.  Computational Linguistics , 1986, Communications in Computer and Information Science.

[8]  Mario Fritz,et al.  A Pooling Approach to Modelling Spatial Relations for Image Retrieval and Annotation , 2014, ArXiv.

[9]  Dan Klein,et al.  Learning Dependency-Based Compositional Semantics , 2011, CL.

[10]  Melanie Tory,et al.  PhotoScope: visualizing spatiotemporal coverage of photos for construction management , 2009, CHI.

[11]  Yang Wang,et al.  Image Retrieval with Structured Object Queries Using Latent Ranking SVM , 2012, ECCV.

[12]  Dan Klein,et al.  Grounding spatial relations for human-robot interaction , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Hyunjang Kong,et al.  A Method for Processing the Natural Language Query in Ontology-Based Image Retrieval System , 2006, Adaptive Multimedia Retrieval.

[14]  Raymond J. Mooney,et al.  Learning Synchronous Grammars for Semantic Parsing with Lambda Calculus , 2007, ACL.

[15]  Jan Kautz,et al.  Videoscapes: exploring sparse, unstructured video collections , 2012, ACM Trans. Graph..

[16]  S. Levinson Space in language and cognition: Explorations in cognitive diversity , 2003 .

[17]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[18]  Ming-Wei Chang,et al.  Driving Semantic Parsing from the World’s Response , 2010, CoNLL.

[19]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[20]  Patrick Weber,et al.  OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[21]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[22]  Stefanie Tellex,et al.  Towards surveillance video search by natural language query , 2009, CIVR '09.