Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild

In order to retrieve unlabeled images by textual queries, cross-media similarity computation is a key ingredient. Although novel methods are continuously introduced, little has been done to evaluate these methods together with large-scale query log analysis. Consequently, how far have these methods brought us in answering real-user queries is unclear. Given baseline methods that use relatively simple text/image matching, how much progress have advanced models made is also unclear. This paper takes a pragmatic approach to answering the two questions. Queries are automatically categorized according to the proposed query visualness measure and later connected to the evaluation of multiple cross-media similarity models on three test sets. Such a connection reveals that the success of the state of the art is mainly attributed to their good performance on visual-oriented queries, which account for only a small part of real-user queries. To quantify the current progress, we propose a simple text2image method, representing a novel query by a set of images selected from large-scale query log. Consequently, computing cross-media similarity between the query and a given image boils down to comparing the visual similarity between the given image and the selected images. Image retrieval experiments on the challenging Clickture dataset show that the proposed text2image is a strong baseline, comparing favorably to recent deep learning alternatives.

[1]  Jing Wang,et al.  Clickage: towards bridging semantic and intent gaps via mining click logs of search engines , 2013, ACM Multimedia.

[2]  Jian Wang,et al.  Cross-Modal Retrieval via Deep and Bidirectional Representation Learning , 2016, IEEE Transactions on Multimedia.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[5]  Qingming Huang,et al.  Cross-Modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation , 2014, IEEE Transactions on Multimedia.

[6]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yi Yang,et al.  Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[8]  John Shepherd,et al.  Efficient benchmarking of content-based image retrieval via resampling , 2006, MM '06.

[9]  Yi Yang,et al.  Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[10]  Yanjun Qi,et al.  Polynomial Semantic Indexing , 2009, NIPS.

[11]  Bo Wang,et al.  Multi-Instance Multi-Label Learning Combining Hierarchical Context and its Application to Image Annotation , 2016, IEEE Transactions on Multimedia.

[12]  Chong-Wah Ngo,et al.  Image search by graph-based label propagation with image representation from DNN , 2013, MM '13.

[13]  Wei-Ying Ma,et al.  Bag-of-Words Based Deep Neural Network for Image Retrieval , 2014, ACM Multimedia.

[14]  Yan-Ying Chen,et al.  Search-based relevance association with auxiliary contextual cues , 2013, MM '13.

[15]  Marcel Worring,et al.  Classification of user image descriptions , 2004, Int. J. Hum. Comput. Stud..

[16]  Ivor W. Tsang,et al.  Textual Query of Personal Photos Facilitated by Large-Scale Web Data , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[18]  Helen Ashman,et al.  Evaluating implicit judgments from image search clickthrough data , 2012, J. Assoc. Inf. Sci. Technol..

[19]  Xueming Qian,et al.  Tag-Based Image Search by Social Re-ranking , 2016, IEEE Transactions on Multimedia.

[20]  Chaoran Cui,et al.  Improving image annotation via ranking‐oriented neighbor search and learning‐based keyword propagation , 2014, J. Assoc. Inf. Sci. Technol..

[21]  Chong-Wah Ngo,et al.  Click-through-based Subspace Learning for Image Search , 2014, ACM Multimedia.

[22]  Wei Liu,et al.  Discriminative Dictionary Learning With Common Label Alignment for Cross-Modal Retrieval , 2016, IEEE Transactions on Multimedia.

[23]  Chong-Wah Ngo,et al.  Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Hsiao-Tieh Pu,et al.  An analysis of failed queries for web image retrieval , 2008, J. Inf. Sci..

[25]  Marcel Worring,et al.  Learning Social Tag Relevance by Neighbor Voting , 2009, IEEE Transactions on Multimedia.

[26]  Hongxun Yao,et al.  Learning Cross Space Mapping via DNN Using Large Scale Click-Through Logs , 2015, IEEE Transactions on Multimedia.

[27]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[28]  Chong-Wah Ngo,et al.  Click-through-based cross-view learning for image search , 2014, SIGIR.

[29]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[30]  Martha Larson,et al.  Intent-Aware Video Search Result Optimization , 2014, IEEE Transactions on Multimedia.

[31]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[32]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[33]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Sourav S. Bhowmick,et al.  Tag-based social image retrieval: An empirical evaluation , 2011, J. Assoc. Inf. Sci. Technol..

[35]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[36]  Sourav S. Bhowmick,et al.  Quantifying tag representativeness of visual content of social images , 2010, ACM Multimedia.

[37]  Kobus Barnard,et al.  Evaluating image retrieval , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[38]  Yueting Zhuang,et al.  Learning of Multimodal Representations With Random Walks on the Click Graph , 2016, IEEE Transactions on Image Processing.

[39]  Xiaoyong Du,et al.  Zero-shot Image Tagging by Hierarchical Semantic Embedding , 2015, SIGIR.

[40]  Alberto Del Bimbo,et al.  Socializing the Semantic Gap , 2015, ACM Comput. Surv..

[41]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[43]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[44]  Peter G. B. Enser,et al.  The evolution of visual information retrieval , 2008, J. Inf. Sci..

[45]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[46]  Xiaoyong Du,et al.  Image Retrieval by Cross-Media Relevance Fusion , 2015, ACM Multimedia.

[47]  Ara V. Nefian,et al.  Learning Concept Templates from Web Images to Query Personal Image Databases , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[48]  Xian-Sheng Hua,et al.  Towards a Relevant and Diverse Search of Social Images , 2010, IEEE Transactions on Multimedia.

[49]  Yuan Dong,et al.  France Telecom Orange Labs (Beijing) AT MSR-BING CHALLENGE ON IMAGE RETRIEVAL 2013 , 2013 .

[50]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[52]  Hermann Ney,et al.  Features for image retrieval: an experimental comparison , 2008, Information Retrieval.

[53]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[54]  Gang Wang,et al.  Click-through-based Deep Visual-Semantic Embedding for Image Search , 2015, ACM Multimedia.

[55]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[56]  QianXueming,et al.  Tag-Based Image Search by Social Re-ranking , 2016 .

[57]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[58]  Amanda Spink,et al.  Image searching on the Excite Web search engine , 2001, Inf. Process. Manag..

[59]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[60]  Qi Tian,et al.  Constructing Concept Lexica With Small Semantic Gaps , 2010, IEEE Transactions on Multimedia.

[61]  Yiqun Liu,et al.  Automatic Query Type Identification Based on Click Through Information , 2006, AIRS.

[62]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[63]  Chaoran Cui,et al.  Social Tag Relevance Estimation via Ranking-Oriented Neighbour Voting , 2015, ACM Multimedia.

[64]  Jialie Shen,et al.  The effects of multiple query evidences on social image retrieval , 2014, Multimedia Systems.

[65]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[66]  W. Bruce Croft,et al.  Linear feature-based models for information retrieval , 2007, Information Retrieval.

[67]  Cees Snoek,et al.  Image2Emoji: Zero-shot Emoji Prediction for Visual Media , 2015, ACM Multimedia.

[68]  Yi Yang,et al.  Cross-media relevance mining for evaluating text-based image search engine , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).