Click-through-based Deep Visual-Semantic Embedding for Image Search

The problem of image search is mostly considered from the perspectives of feature-based vector model and image ranker learning. A fundamental issue that underlies the success of these approaches is the similarity learning between query and image. The need of image surrounding texts in feature-based vector model, however, makes the similarity sensitive to the quality of text descriptions. On the other, the image ranker learning can suffer from robustness problem, originating from the fact that human labeled query-image pairs do not always predict user search intention precisely. We demonstrate in this paper that the above two issues can be well mitigated by jointly exploring visual-semantic embedding and the use of click-through data. Specifically, we propose a novel click-through-based deep visual-semantic embedding (C-DVSE) model for learning query and image similarity. The proposed model consists of two components: a deep convolutional neural networks followed by an image embedding layer for learning visual embedding, and a deep neural networks for generating query semantic embedding. The objective of our model is to maximize the correlation between semantic (query) and visual (clicked image) embedding. When the visual-semantic embedding is learnt, query-image similarity can be directly computed by cosine similarity on this embedding space. On a large-scale click-based image dataset with 11.7 million queries and one million images, our model is shown to be powerful for keyword-based image search with superior performance over several state-of-the-art methods.

[1]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[2]  Chong-Wah Ngo,et al.  Click-through-based cross-view learning for image search , 2014, SIGIR.

[3]  Chong-Wah Ngo,et al.  Click-through-based Subspace Learning for Image Search , 2014, ACM Multimedia.

[4]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[5]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[6]  David Grangier,et al.  A Discriminative Kernel-based Model to Rank Images from Text Queries , 2007 .

[7]  Chong-Wah Ngo,et al.  Annotation for free: video tagging by mining user search behavior , 2013, ACM Multimedia.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Jing Wang,et al.  Clickage: towards bridging semantic and intent gaps via mining click logs of search engines , 2013, ACM Multimedia.

[10]  Chong-Wah Ngo,et al.  Semi-supervised Domain Adaptation with Subspace Learning for visual recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13]  Wei-Ying Ma,et al.  Bag-of-Words Based Deep Neural Network for Image Retrieval , 2014, ACM Multimedia.