Online Asymmetric Similarity Learning for Cross-Modal Retrieval

Cross-modal retrieval has attracted intensive attention in recent years. Measuring the semantic similarity between heterogeneous data objects is an essential yet challenging problem in cross-modal retrieval. In this paper, we propose an online learning method to learn the similarity function between heterogeneous modalities by preserving the relative similarity in the training data, which is modeled as a set of bi-directional hinge loss constraints on the cross-modal training triplets. The overall online similarity function learning problem is optimized by the margin based Passive-Aggressive algorithm. We further extend the approach to learn similarity function in reproducing kernel Hilbert spaces by kernelizing the approach and combining multiple kernels derived from different layers of the CNN features using the Hedging algorithm. Theoretical mistake bounds are given for our methods. Experiments conducted on real world datasets well demonstrate the effectiveness of our methods.

[1]  Qingming Huang,et al.  Hedged Deep Tracking , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[3]  Jian Wang,et al.  Cross-Modal Retrieval via Deep and Bidirectional Representation Learning , 2016, IEEE Transactions on Multimedia.

[4]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[7]  Rong Jin,et al.  Online Multiple Kernel Similarity Learning for Visual Search , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Qi Tian,et al.  Multi-feature metric learning with knowledge transfer among semantics and social tagging , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[11]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[13]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[14]  Josef Kittler,et al.  Multiple Kernel Learning via Distance Metric Learning for Interactive Image Retrieval , 2011, MCS.

[15]  Qingming Huang,et al.  Cross-Modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation , 2014, IEEE Transactions on Multimedia.

[16]  Chong-Wah Ngo,et al.  Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[18]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[19]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[20]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[21]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[22]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[23]  Rong Jin,et al.  Online Multiple Kernel Learning: Algorithms and Mistake Bounds , 2010, ALT.

[24]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[25]  Kristen Grauman,et al.  Accounting for the Relative Importance of Objects in Image Retrieval , 2010, BMVC.

[26]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[27]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[28]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Yoram Singer,et al.  Online and batch learning of pseudo-metrics , 2004, ICML.

[30]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[31]  Inderjit S. Dhillon,et al.  Metric and Kernel Learning Using a Linear Transformation , 2009, J. Mach. Learn. Res..

[32]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jun Wang,et al.  Metric Learning with Multiple Kernels , 2011, NIPS.

[34]  Amir Globerson,et al.  Metric Learning by Collapsing Classes , 2005, NIPS.

[35]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[36]  Zhanghui Kuang,et al.  Relatively-Paired Space Analysis: Learning a Latent Common Space From Relatively-Paired Observations , 2015, International Journal of Computer Vision.

[37]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Yanjun Qi,et al.  Learning to rank with (a lot of) word features , 2010, Information Retrieval.

[39]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[40]  Inderjit S. Dhillon,et al.  Online Metric Learning and Fast Similarity Search , 2008, NIPS.

[41]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.