Semi-supervised cross-modal learning for cross modal retrieval and image annotation

Multimedia data are usually associated with multiple modalities represented by heterogeneous features. Recently, many information retrieval tasks are not only restricted to the case of a single modal and the contend-based cross modal retrieval has become one of the popular research fields. The premise of cross modal retrieval is discovering the relationships between different modalities efficiently. Though some approaches have been proposed to address this challenging problem, they either ignores the precious labels, or heavily depends on the completely labeled training data. In addition, for features with relatively high dimensionality, it is of great importance to select the most informative ones. In this paper, we propose a semi-supervised algorithm for cross modal learning. Our algorithm can make full use of both a small number of labeled and an abundant unlabeled data to establish connections between modalities via a shared semantic space discovering. On the other hand, our algorithm automatically filter out the noisy and redundant features to further improve our model. Finally, we give an efficient solution to the objective function. The experiments on two publicly available datasets demonstrate that the proposed method is competitive with or even superior to the state-of-art counterparts.

[1]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[2]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[3]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Ke Zhang,et al.  Multi-View Embedding Learning for Incompletely Labeled Data , 2013, IJCAI.

[5]  Nicu Sebe,et al.  Knowledge adaptation for ad hoc multimedia event detection with few exemplars , 2012, ACM Multimedia.

[6]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[7]  Zi Huang,et al.  Discrete Multimodal Hashing With Canonical Views for Robust Mobile Landmark Search , 2017, IEEE Transactions on Multimedia.

[8]  Nicu Sebe,et al.  Quantization-based hashing: a general framework for scalable image and video retrieval , 2018, Pattern Recognit..

[9]  Cordelia Schmid,et al.  Multimodal semi-supervised learning for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Xiaoyong Du,et al.  AML: Efficient Approximate Membership Localization within a Web-Based Join Framework , 2013, IEEE Transactions on Knowledge and Data Engineering.

[11]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[12]  Nicu Sebe,et al.  A Survey on Learning to Hash , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Kristen Grauman,et al.  Active Learning of an Action Detector from Untrimmed Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Hong Cheng,et al.  TRIP: An Interactive Retrieving-Inferring Data Imputation Approach , 2015, IEEE Transactions on Knowledge and Data Engineering.

[16]  Francis R. Bach,et al.  Trace Lasso: a trace norm regularization for correlated designs , 2011, NIPS.

[17]  Yee Whye Teh,et al.  Names and faces in the news , 2004, CVPR 2004.

[18]  D. Jacobs,et al.  Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch , 2011, CVPR 2011.

[19]  Xiaoyong Du,et al.  CoRE: A Context-Aware Relation Extraction Method for Relation Completion , 2013, IEEE Transactions on Knowledge and Data Engineering.

[20]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[21]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[22]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[23]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[24]  Wen Gao,et al.  Face recognition based on non-corresponding region matching , 2011, 2011 International Conference on Computer Vision.

[25]  Zi Huang,et al.  Multi-Feature Fusion via Hierarchical Regression for Multimedia Analysis , 2013, IEEE Transactions on Multimedia.

[26]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[27]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[28]  Hagai Attias,et al.  Topic regression multi-modal Latent Dirichlet Allocation for image annotation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[30]  Ji Zhu,et al.  A ug 2 01 0 Group Variable Selection via a Hierarchical Lasso and Its Oracle Property Nengfeng Zhou Consumer Credit Risk Solutions Bank of America Charlotte , NC 28255 , 2010 .

[31]  Kristen Grauman,et al.  Reading between the lines: Object localization using implicit cues from image tags , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Meng Wang,et al.  Self-Supervised Video Hashing With Hierarchical Binary Auto-Encoder , 2018, IEEE Transactions on Image Processing.

[33]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Stephen P. Boyd,et al.  A rank minimization heuristic with application to minimum order system approximation , 2001, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148).