Structured max-margin learning for multi-label image annotation

In this paper, a structured max-margin learning scheme is developed to achieve more effective training of a large number of inter-related classifiers for multi-label image annotation. First, a visual concept network is constructed to characterize the inter-concept visual similarity contexts more precisely and determine the inter-related learning tasks automatically. Second, multiple base kernels are combined to achieve more precise characterization of the diverse visual similarity contexts between the images and address the issue of huge intra-concept visual diversity more effectively. Third, a structured max-margin learning algorithm is developed by incorporating the visual concept network, max-margin Markov networks and multi-task learning to address the issue of huge inter-concept visual similarity more effectively. Our structured max-margin learning algorithm can leverage the inter-concept visual similarity contexts to learn a large number of inter-related classifiers simultaneously and improve their discrimination power significantly. Our experiments have also obtained very positive results.

[1]  Nenghai Yu,et al.  Flickr distance , 2008, ACM Multimedia.

[2]  Antonio Torralba,et al.  Nonparametric scene parsing: Label transfer via dense scene alignment , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Antonio Torralba,et al.  Learning hierarchical models of scenes, objects, and parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[4]  Jitendra Malik,et al.  Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[5]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[6]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[7]  Jianping Fan,et al.  Integrating Concept Ontology and Multitask Learning to Achieve More Effective Classifier Training for Multilevel Image Annotation , 2008, IEEE Transactions on Image Processing.

[8]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[9]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[10]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[11]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Guido Smits,et al.  Improved SVM regression using mixtures of kernels , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[13]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[14]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[15]  Alexander G. Hauptmann,et al.  Discriminative Fields for Modeling Semantic Concepts in Video , 2007, RIAO.

[16]  Nuno Vasconcelos,et al.  Image indexing with mixture hierarchies , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[17]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[18]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[19]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[20]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[21]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[22]  Antonio Torralba,et al.  Sharing features: efficient boosting procedures for multiclass object detection , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[23]  Shih-Fu Chang,et al.  Context-Based Concept Fusion with Boosted Conditional Random Fields , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[25]  Eric P. Xing,et al.  Harmonium Models for Semantic Video Representation and Classification , 2007, SDM.

[26]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .