Deep Top-$k$ Ranking for Image–Sentence Matching

Image–sentence matching is a challenging task for the heterogeneity-gap between different modalities. Ranking-based methods have achieved excellent performance in this task in past decades. Given an image query, these methods typically assume that the correct matched image–sentence pair must rank before all other mismatched ones. However, this assumption may be too strict and prone to the overfitting problem, especially when some sentences in a massive database are similar and confusable with one another. In this paper, we relax the traditional ranking loss and propose a novel deep multi-modal network with a top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> ranking loss to mitigate the data ambiguity problem. With this strategy, query results will not be penalized unless the index of ground truth is outside the range of top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> query results. Considering the non-smoothness and non-convexity of the initial top-<inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> ranking loss, we exploit a tight convex upper bound to approximate the loss and then utilize the traditional back-propagation algorithm to optimize the deep multi-modal network. Finally, we apply the method on three benchmark datasets, namely, Flickr8k, Flickr30k, and MSCOCO. Empirical results on metrics R@K (K = 1, 5, 10) show that our method achieves comparable performance in comparison to state-of-the-art methods.

[1]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Mathias Lux,et al.  Cross Media Retrieval in Knowledge Discovery , 2004, PAKM.

[3]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Lior Wolf,et al.  Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation , 2014, ArXiv.

[6]  L. Montanarella,et al.  Chemometric classification of some european wines using pyrolysis mass spectrometry , 1995 .

[7]  Dean P. Foster,et al.  Finding Linear Structure in Large Datasets with Scalable Canonical Correlation Analysis , 2015, ICML.

[8]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[9]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[10]  Yueting Zhuang,et al.  A low rank structural large margin method for cross-modal ranking , 2013, SIGIR.

[11]  Michael K. Ng,et al.  Sparse Kernel Canonical Correlation Analysis via $\ell_1$-regularization , 2017, 1701.04207.

[12]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[13]  Bernt Schiele,et al.  Top-k Multiclass SVM , 2015, NIPS.

[14]  Shian-Shyong Tseng,et al.  Knee Point Search Using Cascading Top-k Sorting with Minimized Time Complexity , 2013, TheScientificWorldJournal.

[15]  Qingming Huang,et al.  Cross-Modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation , 2014, IEEE Transactions on Multimedia.

[16]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[17]  Qinghua Zheng,et al.  Avoiding Optimal Mean ℓ2,1-Norm Maximization-Based Robust PCA for Reconstruction , 2017, Neural Computation.

[18]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[19]  Zhihui Li,et al.  Top-k multi-class SVM using multiple features , 2017, Inf. Sci..

[20]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Bernt Schiele,et al.  Loss Functions for Top-k Error: Analysis and Insights , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[23]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[24]  C. V. Jawahar,et al.  Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval , 2014, BMVC.

[25]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yongfeng Huang,et al.  Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval , 2017, IEEE Transactions on Multimedia.

[27]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  A. Murat Tekalp,et al.  Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis , 2007, IEEE Transactions on Multimedia.

[29]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[30]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[31]  Frank Rudzicz,et al.  Adaptive Kernel Canonical Correlation Analysis for Estimation of Task Dynamics from Acoustics , 2010, ICASSP.

[32]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[33]  Aviv Eisenschtat,et al.  Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jiwen Lu,et al.  Deep Coupled Metric Learning for Cross-Modal Matching , 2017, IEEE Transactions on Multimedia.

[35]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Thomas G. Dietterich,et al.  Transductive Optimization of Top k Precision , 2015, IJCAI.

[37]  Qinghua Zheng,et al.  Simple to Complex Cross-modal Learning to Rank , 2017, Comput. Vis. Image Underst..

[38]  Yueting Zhuang,et al.  Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment , 2015, ACM Multimedia.

[39]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[40]  Raman Arora,et al.  Multi-view CCA-based acoustic features for phonetic recognition across speakers and domains , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[42]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[43]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[44]  Yi Yang,et al.  Robust Top-k Multiclass SVM for Visual Category Recognition , 2017, KDD.

[45]  Qinghua Zheng,et al.  Avoiding Optimal Mean Robust PCA/2DPCA with Non-greedy ℓ1-Norm Maximization , 2016, IJCAI.

[46]  Martin A. Riedmiller,et al.  Advanced supervised learning in multi-layer perceptrons — From backpropagation to adaptive learning algorithms , 1994 .

[47]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[48]  Maya R. Gupta,et al.  Training highly multiclass classifiers , 2014, J. Mach. Learn. Res..

[49]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[50]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[51]  Wei Wang,et al.  Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[53]  Qi Tian,et al.  Cross-Modal Retrieval Using Multiordered Discriminative Structured Subspace Learning , 2017, IEEE Transactions on Multimedia.

[54]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..