Learning Click-Based Deep Structure-Preserving Embeddings with Visual Attention

One fundamental problem in image search is to learn the ranking functions (i.e., the similarity between query and image). Recent progress on this topic has evolved through two paradigms: the text-based model and image ranker learning. The former relies on image surrounding texts, making the similarity sensitive to the quality of textual descriptions. The latter may suffer from the robustness problem when human-labeled query-image pairs cannot represent user search intent precisely. We demonstrate in this article that the preceding two limitations can be well mitigated by learning a cross-view embedding that leverages click data. Specifically, a novel click-based Deep Structure-Preserving Embeddings with visual Attention (DSPEA) model is presented, which consists of two components: deep convolutional neural networks followed by image embedding layers for learning visual embedding, and a deep neural networks for generating query semantic embedding. Meanwhile, visual attention is incorporated at the top of the convolutional neural network to reflect the relevant regions of the image to the query. Furthermore, considering the high dimension of the query space, a new click-based representation on a query set is proposed for alleviating this sparsity problem. The whole network is end-to-end trained by optimizing a large margin objective that combines cross-view ranking constraints with in-view neighborhood structure preservation constraints. On a large-scale click-based image dataset with 11.7 million queries and 1 million images, our model is shown to be powerful for keyword-based image search with superior performance over several state-of-the-art methods and achieves, to date, the best reported NDCG@25 of 52.21%.

[1]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[2]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[3]  Chong-Wah Ngo,et al.  Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Zhongfei Zhang,et al.  Discriminative feature selection for multi-view cross-domain learning , 2013, CIKM.

[5]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Wei-Ying Ma,et al.  Bag-of-Words Based Deep Neural Network for Image Retrieval , 2014, ACM Multimedia.

[7]  Vidit Jain,et al.  Learning to re-rank: query-dependent image re-ranking using click data , 2011, WWW.

[8]  Gang Wang,et al.  Click-through-based Deep Visual-Semantic Embedding for Image Search , 2015, ACM Multimedia.

[9]  Chong-Wah Ngo,et al.  Semi-supervised Domain Adaptation with Subspace Learning for visual recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Joseph P. Romano On the behaviour of randomization tests without the group invariance assumption , 1990 .

[11]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[12]  Yanjun Qi,et al.  Supervised semantic indexing , 2009, ECIR.

[13]  Chong-Wah Ngo,et al.  Circular Reranking for Visual Search , 2013, IEEE Transactions on Image Processing.

[14]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[15]  Wei Wu,et al.  Learning query and document similarities from click-through bipartite graph with metadata , 2013, WSDM.

[16]  Yongdong Zhang,et al.  Double-Bit Quantization and Index Hashing for Nearest Neighbor Search , 2019, IEEE Transactions on Multimedia.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[19]  Tao Mei,et al.  Share-and-Chat: Achieving Human-Level Video Commenting by Search and Multi-View Embedding , 2016, ACM Multimedia.

[20]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[21]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[23]  Jian Wang,et al.  Cross-Modal Retrieval via Deep and Bidirectional Representation Learning , 2016, IEEE Transactions on Multimedia.

[24]  Chong-Wah Ngo,et al.  Annotation for free: video tagging by mining user search behavior , 2013, ACM Multimedia.

[25]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[26]  Shiguang Shan,et al.  Multi-View Discriminant Analysis , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Yanjun Qi,et al.  Polynomial Semantic Indexing , 2009, NIPS.

[28]  Chong-Wah Ngo,et al.  Click-through-based Subspace Learning for Image Search , 2014, ACM Multimedia.

[29]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[30]  Tat-Seng Chua,et al.  Learning from Collective Intelligence , 2016, ACM Trans. Multim. Comput. Commun. Appl..

[31]  Chong-Wah Ngo,et al.  Image search by graph-based label propagation with image representation from DNN , 2013, MM '13.

[32]  Chong-Wah Ngo,et al.  Co-reranking by mutual reinforcement for image search , 2010, CIVR '10.

[33]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[34]  Jing Wang,et al.  Clickage: towards bridging semantic and intent gaps via mining click logs of search engines , 2013, ACM Multimedia.

[35]  Mikhail Belkin,et al.  Laplacian Support Vector Machines Trained in the Primal , 2009, J. Mach. Learn. Res..

[36]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Yi Yang,et al.  Effective transfer tagging from image to video , 2013, TOMCCAP.

[40]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[41]  Kenji Fukumizu,et al.  Statistical Consistency of Kernel Canonical Correlation Analysis , 2007 .

[42]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[43]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[44]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[45]  Yongdong Zhang,et al.  Scalable Similarity Search With Topology Preserving Hashing , 2014, IEEE Transactions on Image Processing.

[46]  Chong-Wah Ngo,et al.  Click-through-based cross-view learning for image search , 2014, SIGIR.

[47]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.