Inferring generic activities and events from image content and bags of geo-tags

The use of contextual information in building concept detectors for digital media has caught the attention of the multimedia community in the recent years. Generally speaking, any information extracted from image headers or tags, or from large collections of related images and used at classification time, can be considered as contextual. Such information, being discriminative in its own right, when combined with pure content-based detection systems using pixel information, can improve the overall recognition performance significantly. In this paper, we describe a framework for probabilistically modeling geographical information using a Geographical Information Systems (GIS) database for event and activity recognition in general-purpose consumer images, such as those obtained from Flickr. The proposed framework discriminatively models the statistical saliency of geo-tags in describing an activity or event. Our work leverages the inherent patterns of association between events and their geographical venues. We use descriptions of small local neighborhoods to form bags of geo tags as our representation. Statistical coherence is observed in such descriptions across a wide range of event classes and across many different users. In order to test our approach, we identify certain classes of activities and events wherein people commonly participate and take pictures. Images and corresponding metadata, for the identified events and activities, are obtained from Flickr. We employ visual detectors obtained from Columbia University (Columbia 374), which perform pure visual event and activity recognition. In our experiments, we present the performance advantage obtained by combining contextual GPS information with pixel-based detection systems.

[1]  Jiebo Luo,et al.  Large-scale multimodal semantic concept detection for consumer video , 2007, MIR '07.

[2]  Dong Liu,et al.  LORE: An infrastructure to support location-aware services , 2004, IBM J. Res. Dev..

[3]  Ravi Kumar,et al.  Visualizing tags over time , 2006, WWW '06.

[4]  Mor Naaman,et al.  Generating summaries and visualization for large collections of geo-referenced photographs , 2006, MIR '06.

[5]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[6]  Annika Hinze,et al.  Locations- and Time-Based Information Delivery in Tourism , 2003, SSTD.

[7]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[8]  Thanasis Hadzilacos,et al.  Advances in Spatial and Temporal Databases , 2015, Lecture Notes in Computer Science.

[9]  John R. Smith,et al.  On the detection of semantic concepts at TRECVID , 2004, MULTIMEDIA '04.

[10]  Djemel Ziou,et al.  Image Retrieval from the World Wide Web: Issues, Techniques, and Systems , 2004, CSUR.

[11]  Daniel Gatica-Perez,et al.  On image auto-annotation with latent space models , 2003, ACM Multimedia.

[12]  Rong Yan,et al.  Cross-domain video concept detection using adaptive svms , 2007, ACM Multimedia.

[13]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[14]  Jiebo Luo,et al.  Pictures are not taken in a vacuum - an overview of exploiting context for semantic scene content understanding , 2006, IEEE Signal Processing Magazine.

[15]  James Ze Wang,et al.  Real-Time Computerized Annotation of Pictures , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  James Ze Wang,et al.  Tagging over time: real-world image annotation by lightweight meta-learning , 2007, ACM Multimedia.

[17]  Ouri Wolfson,et al.  Extracting Semantic Location from Outdoor Positioning Systems , 2006, 7th International Conference on Mobile Data Management (MDM'06).

[18]  Mor Naaman,et al.  Why we tag: motivations for annotation in mobile and online media , 2007, CHI.

[19]  Jiebo Luo,et al.  Kodak consumer video benchmark data set : concept definition and annotation * * , 2008 .

[20]  Mor Naaman,et al.  How flickr helps us make sense of the world: context and content in community-contributed media collections , 2007, ACM Multimedia.

[21]  Shih-Fu Chang,et al.  Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts , 2007 .

[22]  Milind R. Naphade,et al.  Semantics reinforcement and fusion learning for multimedia streams , 2007, CIVR '07.

[23]  Ron Sivan,et al.  Web-a-where: geotagging web content , 2004, SIGIR '04.