Semantic context modeling with maximal margin Conditional Random Fields for automatic image annotation

Context modeling for Vision Recognition and Automatic Image Annotation (AIA) has attracted increasing attentions in recent years. For various contextual information and resources, semantic context has been exploited in AIA and brings promising results. However, previous works either casted the problem into structural classification or adopted multi-layer modeling, which suffer from the problems of scalability or model efficiency. In this paper, we propose a novel discriminative Conditional Random Field (CRF) model for semantic context modeling in AIA, which is built over semantic concepts and treats an image as a whole observation without segmentation. Our model captures the interactions between semantic concepts from both semantic level and visual level in an integrated manner. Specifically, we employ graph structure to model contextual relationships between semantic concepts. The potential functions are designed based on linear discriminative models, which enables us to propose a novel decoupled hinge loss function for maximal margin parameter estimation. We train the model by solving a set of independent quadratic programming problems with our derived contextual kernel. The experiments are conducted on commonly used benchmarks: Corel and TRECVID data sets for evaluation. The experimental results show that compared with the state-of-the-art methods, our method achieves significant improvement on annotation performance.

[1]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[2]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[3]  Stan Z. Li,et al.  Markov Random Field Modeling in Computer Vision , 1995, Computer Science Workbench.

[4]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[5]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  Jing Hua,et al.  Region-based Image Annotation using Asymmetrical Support Vector Machine-based Multiple-Instance Learning , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[8]  Martial Hebert,et al.  Discriminative Fields for Modeling Spatial Dependencies in Natural Images , 2003, NIPS.

[9]  Raimondo Schettini,et al.  Image annotation using SVM , 2003, IS&T/SPIE Electronic Imaging.

[10]  Jake Porway,et al.  A hierarchical and contextual model for aerial image understanding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[12]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[13]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[14]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[15]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[16]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[19]  Nuno Vasconcelos,et al.  Holistic context modeling using semantic co-occurrences , 2009, CVPR.

[20]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[22]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[23]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[24]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[25]  Chong-Wah Ngo,et al.  A revisit of Generative Model for Automatic Image Annotation using Markov Random Fields , 2009, CVPR.

[26]  Gustavo Carneiro,et al.  Supervised Learning of Semantic Classes for Image Annotation and Retrieval , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Lior Wolf,et al.  A Critical View of Context , 2006, International Journal of Computer Vision.