Multimodal Location Estimation of Consumer Media: Dealing with Sparse Training Data

This article describes a novel approach to the problem of associating geo-locations to consumer-produced multimedia data such as videos and photos that are publicly available on social networking websites such as Flickr. We specifically focus on the case where the available training data is sparse both in absolute numbers as well as geographic coverage when compared to the number of untagged query data. We develop a novel graphical model based framework for the problem of interest and pose the problem of geotagging as one of inference over this graph. The novelty of our algorithm lies in the fact that we jointly estimate the geo-locations of all the query videos, which helps obtain performance improvements over existing algorithms in the literature that process each query video independently. Our system enables the query videos to act as "virtual" training data that effectively bootstrap the geo-tagging process. The quality of the database improves with each additional query video in the system. Further, our modeling provides a generic theoretical framework that can be used to incorporate any other available textual, visual or audio features. We evaluate our algorithm on the MediaEval 2011 Placing Task data set and show that for fixed training data the system performance improves with an increasing number of unlabeled test data. The performance gains are shown to be over 10% as compared to existing algorithms in the literature.

[1]  Nikos A. Vlassis,et al.  A Greedy EM Algorithm for Gaussian Mixture Learning , 2002, Neural Processing Letters.

[2]  Wei Zhang,et al.  Image Based Localization in Urban Environments , 2006, Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT'06).

[3]  Richard Szeliski,et al.  City-Scale Location Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Alexei A. Efros,et al.  IM2GPS: estimating geographic information from a single image , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[6]  Mor Naaman,et al.  Methods for extracting place semantics from Flickr tags , 2009, TWEB.

[7]  Jon M. Kleinberg,et al.  Mapping the world's photos , 2009, WWW '09.

[8]  Jiebo Luo,et al.  Enhancing semantic and geographic annotation of web images via logistic canonical correlation regression , 2009, ACM Multimedia.

[9]  Jiebo Luo,et al.  Geo-location inference from image content and user tags , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[10]  Pavel Serdyukov,et al.  Placing flickr photos on a map , 2009, SIGIR.

[11]  Jiebo Luo,et al.  Geotagging in multimedia and computer vision—a survey , 2010, Multimedia Tools and Applications.

[12]  Trevor Darrell,et al.  Multimodal location estimation , 2010, ACM Multimedia.

[13]  Gerald Friedland,et al.  The 2010 ICSI Video Location Estimation System , 2010 .

[14]  Adam Rae,et al.  Working Notes for the Placing Task at MediaEval 2011 , 2011, MediaEval.

[15]  T. Sikora,et al.  A hierarchical, multi-modal approach for placing videos on the map using millions of Flickr photographs , 2011, SBNMA '11.

[16]  Steven Schockaert,et al.  Ghent University at the 2011 Placing Task , 2011, MediaEval.

[17]  Le Song,et al.  Kernel Belief Propagation , 2011, AISTATS.

[18]  Adam L. Janin,et al.  Multimodal location estimation on Flickr videos , 2011, WSM '11.

[19]  Mohammad Soleymani,et al.  Automatic tagging and geotagging in video collections and communities , 2011, ICMR.