Learning query and image similarities with listwise supervision

One of the fundamental problems in image search is to learn the ranking functions, i.e., similarity between the textual query and visual image. A number of research paradigms, ranging from feature-based vector model to image ranker learning, have been applied to measure query-image similarity. However, most of the existing similarity learning methods either depend on surrounding texts for ranking images or learn image rankers to satisfy pairwise or tripletwise supervision. In this paper, we propose to leverage listwise supervision into a principled click-through-based query-image similarity learning framework. In particular, the algorithm utilizes click counts for each image in response to a query to get a ranking list. The ranking information is represented by a set of rank triplets that can be used to assess the quality of the ranking list. The image ranking problem is then solved efficiently by learning two linear projections for query and image space respectively, through maximizing the ranking quality over all the training data. When the two linear projections are learnt, query-image similarity can be directly computed by dot product on this projected subspace. On a large-scale click-through-based image dataset with 11.7 million queries and one million images, our learnt model via listwise supervision is shown to be powerful for keyword-based image search with superior performance over several state-of-the-art methods.

[1]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[2]  Yanjun Qi,et al.  Polynomial Semantic Indexing , 2009, NIPS.

[3]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[4]  Chong-Wah Ngo,et al.  Click-through-based Subspace Learning for Image Search , 2014, ACM Multimedia.

[5]  Rong Jin,et al.  Learning to Rank by Optimizing NDCG Measure , 2009, NIPS.

[6]  David Grangier,et al.  A Discriminative Kernel-based Model to Rank Images from Text Queries , 2007 .

[7]  Chong-Wah Ngo,et al.  Image search by graph-based label propagation with image representation from DNN , 2013, MM '13.

[8]  Jing Wang,et al.  Fast Neighborhood Graph Search Using Cartesian Concatenation , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Chong-Wah Ngo,et al.  Click-through-based cross-view learning for image search , 2014, SIGIR.

[10]  Vidit Jain,et al.  Learning to re-rank: query-dependent image re-ranking using click data , 2011, WWW.

[11]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[12]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[13]  Chong-Wah Ngo,et al.  Semi-supervised Hashing with Semantic Confidence for Large Scale Visual Search , 2015, SIGIR.

[14]  Joseph P. Romano On the behaviour of randomization tests without the group invariance assumption , 1990 .

[15]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[18]  Chong-Wah Ngo,et al.  Annotation for free: video tagging by mining user search behavior , 2013, ACM Multimedia.

[19]  Jing Wang,et al.  Clickage: towards bridging semantic and intent gaps via mining click logs of search engines , 2013, ACM Multimedia.

[20]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[21]  Chong-Wah Ngo,et al.  Semi-supervised Domain Adaptation with Subspace Learning for visual recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).