Image Annotation Within the Context of Personal Photo Collections Using Hierarchical Event and Scene Models

Most image annotation systems consider a single photo at a time and label photos individually. In this work, we focus on collections of personal photos and exploit the contextual information naturally implied by the associated GPS and time metadata. First, we employ a constrained clustering method to partition a photo collection into event-based subcollections, considering that the GPS records may be partly missing (a practical issue). We then use conditional random field (CRF) models to exploit the correlation between photos based on 1) time-location constraints and 2) the relationship between collection-level annotation (i.e., events) and image-level annotation (i.e., scenes). With the introduction of such a multilevel annotation hierarchy, our system addresses the problem of annotating consumer photo collections that requires a more hierarchical description of the customers' activities than do the simpler image annotation tasks. The efficacy of the proposed system is validated by extensive evaluation using a sizable geotagged personal photo collection database, which consists of over 100 photo collections and is manually labeled for 12 events and 12 scenes to create ground truth.

[1]  Jiebo Luo,et al.  Inferring generic activities and events from image content and bags of geo-tags , 2008, CIVR '08.

[2]  M. Opper,et al.  Comparing the Mean Field Method and Belief Propagation for Approximate Inference in MRFs , 2001 .

[3]  Roberto Cipolla,et al.  Improved Image Annotation and Labelling through Multi-Label Boosting , 2005, BMVC.

[4]  Rong Yan,et al.  Mining Relationship Between Video Concepts using Probabilistic Graphical Models , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[5]  Alberto Del Bimbo,et al.  Taking into Consideration Sports Semantic Annotation of Sports Videos Content-based Multimedia Indexing and Retrieval , 2002 .

[6]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[7]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[9]  Luc Van Gool,et al.  Modeling scenes with local descriptors and latent aspects , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Yi Wu,et al.  Sampling Strategies for Active Learning in Personal Photo Retrieval , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[11]  H. Garcia-Molina,et al.  Automatic organization for digital photographs with geographic coordinates , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[12]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  Steven M. Seitz,et al.  Scene Summarization for Online Image Collections , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[14]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Mor Naaman,et al.  Leveraging geo-referenced digital photographs , 2005 .

[16]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[17]  Alexei Yavlinsky,et al.  An online system for gathering image similarity judgements , 2007, ACM Multimedia.

[18]  Martial Hebert,et al.  Discriminative Random Fields , 2006, International Journal of Computer Vision.

[19]  Joo-Hwee Lim,et al.  Home Photo Content Modeling for Personalized Event-Based Retrieval , 2003, IEEE Multim..

[20]  Martial Hebert,et al.  A Comparison of Image Segmentation Algorithms , 2005 .

[21]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22]  Andreas Girgensohn,et al.  Temporal event clustering for digital photo collections , 2003, ACM Multimedia.

[23]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[24]  Jiebo Luo,et al.  Pictures are not taken in a vacuum - an overview of exploiting context for semantic scene content understanding , 2006, IEEE Signal Processing Magazine.

[25]  Mubarak Shah,et al.  Improving Semantic Concept Detection and Retrieval using Contextual Estimates , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[26]  Shih-Fu Chang,et al.  Kernel Sharing With Joint Boosting For Multi-Class Concept Detection , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Mor Naaman,et al.  Generating summaries and visualization for large collections of geo-referenced photographs , 2006, MIR '06.

[28]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[29]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[30]  Dieter Fox,et al.  Location-Based Activity Recognition , 2005, KI.

[31]  M. Cooper Collective Media Annotation using Undirected Random Field Models , 2007 .

[32]  Paul A. Viola,et al.  Boosting Image Retrieval , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[33]  Mor Naaman,et al.  How flickr helps us make sense of the world: context and content in community-contributed media collections , 2007, ACM Multimedia.

[34]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[35]  Kurt Konolige,et al.  Real-time Localization in Outdoor Environments using Stereo Vision and Inexpensive GPS , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[36]  Wei-Ying Ma,et al.  Benchmarking of image features for content-based retrieval , 1998, Conference Record of Thirty-Second Asilomar Conference on Signals, Systems and Computers (Cat. No.98CH36284).

[37]  Lihi Zelnik-Manor,et al.  Event-based analysis of video , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[38]  Jiebo Luo,et al.  Mining compositional features for boosting , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Ara V. Nefian,et al.  Learning Concept Templates from Web Images to Query Personal Image Databases , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[41]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[42]  Andreas E. Savakis,et al.  Automated event clustering and quality screening of consumer pictures for digital albuming , 2003, IEEE Trans. Multim..

[43]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[44]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[45]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[46]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.