An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges

Multimedia retrieval plays an indispensable role in big data utilization. Past efforts mainly focused on single-media retrieval. However, the requirements of users are highly flexible, such as retrieving the relevant audio clips with one query of image. So challenges stemming from the “media gap,” which means that representations of different media types are inconsistent, have attracted increasing attention. Cross-media retrieval is designed for the scenarios where the queries and retrieval results are of different media types. As a relatively new research topic, its concepts, methodologies, and benchmarks are still not clear in the literature. To address these issues, we review more than 100 references, give an overview including the concepts, methodologies, major challenges, and open issues, as well as build up the benchmarks, including data sets and experimental results. Researchers can directly adopt the benchmarks to promptly evaluate their proposed methods. This will help them to focus on algorithm design, rather than the time-consuming compared methods and results. It is noted that we have constructed a new data set XMedia, which is the first publicly available data set with up to five media types (text, image, video, audio, and 3-D model). We believe this overview will attract more researchers to focus on cross-media retrieval and be helpful to them.

[1]  Yao Hu,et al.  Iterative Multi-View Hashing for Cross Media Indexing , 2014, ACM Multimedia.

[2]  Yi Yang,et al.  Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[3]  Liang-Tien Chia,et al.  Cross-media retrieval using query dependent search methods , 2010, Pattern Recognit..

[4]  Xiaohua Zhai,et al.  Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval , 2013, AAAI.

[5]  Nikhil Rasiwasia,et al.  Cluster Canonical Correlation Analysis , 2014, AISTATS.

[6]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[7]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[8]  Wei Liu,et al.  Learning to Hash for Indexing Big Data—A Survey , 2015, Proceedings of the IEEE.

[9]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[10]  Meng Wang,et al.  Neighborhood Discriminant Hashing for Large-Scale Image Retrieval , 2015, IEEE Transactions on Image Processing.

[11]  Yao Zhao,et al.  Modality-Dependent Cross-Media Retrieval , 2015, ACM Trans. Intell. Syst. Technol..

[12]  Hang Li,et al.  Learning Similarity Function between Objects in Heterogeneous Spaces , 2010 .

[13]  Yueting Zhuang,et al.  An approach for cross-media retrieval with cross-reference graph and PageRank , 2006, 2006 12th International Multi-Media Modelling Conference.

[14]  Fei Wang,et al.  Composite hashing with multiple information sources , 2011, SIGIR.

[15]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[16]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[17]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[18]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  C. V. Jawahar,et al.  Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval , 2014, BMVC.

[20]  Yao Zhao,et al.  Mining Semantically Consistent Patterns for Cross-View Data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[21]  Jian Pei,et al.  Parallel field alignment for cross media retrieval , 2013, ACM Multimedia.

[22]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[23]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[25]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[26]  Beng Chin Ooi,et al.  Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[27]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Guiguang Ding,et al.  Latent semantic sparse hashing for cross-modal similarity search , 2014, SIGIR.

[29]  Gert R. G. Lanckriet,et al.  Metric Learning to Rank , 2010, ICML.

[30]  Xinbo Gao,et al.  Semantic Topic Multimodal Hashing for Cross-Media Retrieval , 2015, IJCAI.

[31]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[32]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[33]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[34]  Guiguang Ding,et al.  Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  Yi Zhen,et al.  A probabilistic model for multimodal hash function learning , 2012, KDD.

[37]  Yi Yang,et al.  Ranking with local regression and global alignment for cross media retrieval , 2009, ACM Multimedia.

[38]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[40]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[41]  Yueting Zhuang,et al.  Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment , 2015, ACM Multimedia.

[42]  Qingming Huang,et al.  Effective Multimodality Fusion Framework for Cross-Media Topic Detection , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[43]  Wen Gao,et al.  Parametric Local Multimodal Hashing for Cross-View Similarity Search , 2013, IJCAI.

[44]  Changsheng Xu,et al.  Cross-media retrieval: state-of-the-art and open issues , 2010, Int. J. Multim. Intell. Secur..

[45]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[46]  Philip S. Yu,et al.  Composite Correlation Quantization for Efficient Multimodal Retrieval , 2015, SIGIR.

[47]  Pierre Vandergheynst,et al.  Learning Multi-Modal Dictionaries , 2006 .

[48]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[49]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[50]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[51]  Antonio Torralba,et al.  Learning Aligned Cross-Modal Representations from Weakly Aligned Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Xiaohua Zhai,et al.  Semi-Supervised Cross-Media Feature Learning With Unified Patch Graph Regularization , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[53]  Mikhail Belkin,et al.  Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[54]  Gabriela Csurka,et al.  Semantic combination of textual and visual information in multimedia retrieval , 2011, ICMR.

[55]  Zhou Yu,et al.  Cross-Media Hashing with Neural Networks , 2014, ACM Multimedia.

[56]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Dong Cao,et al.  Self-Paced Cross-Modal Subspace Matching , 2016, SIGIR.

[58]  Giovanni Giuffrida,et al.  Using visual and text features for direct marketing on multimedia messaging services domain , 2009, Multimedia Tools and Applications.

[59]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[60]  Xiaohua Zhai,et al.  Effective Heterogeneous Similarity Measure with Nearest Neighbors for Cross-Media Retrieval , 2012, MMM.

[61]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[62]  Yueting Zhuang,et al.  Learning of Multimodal Representations With Random Walks on the Click Graph , 2016, IEEE Transactions on Image Processing.

[63]  Lior Wolf,et al.  Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Xiaohua Zhai,et al.  Cross-media retrieval by cluster-based correlation analysis , 2013, 2013 IEEE International Conference on Image Processing.

[65]  Yueting Zhuang,et al.  Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[66]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[67]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[68]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Jing Wang,et al.  Clickage: towards bridging semantic and intent gaps via mining click logs of search engines , 2013, ACM Multimedia.

[70]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[71]  Xiaohua Zhai,et al.  Cross-media retrieval by intra-media and inter-media correlation mining , 2013, Multimedia Systems.

[72]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[74]  Xiaohua Zhai,et al.  Cross-modality correlation propagation for cross-media retrieval , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[75]  Gang Hua,et al.  Supervised Matrix Factorization for Cross-Modality Hashing , 2016, IJCAI.

[76]  Chong-Wah Ngo,et al.  Coherent bag-of audio words model for efficient large-scale video copy detection , 2010, CIVR '10.

[77]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[78]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[79]  Yang Yang,et al.  Start from Scratch: Towards Automatically Identifying, Modeling, and Naming Visual Attributes , 2014, ACM Multimedia.

[80]  Yi Zhen,et al.  Co-Regularized Hashing for Multimodal Data , 2012, NIPS.

[81]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[82]  Yueting Zhuang,et al.  Cross-Modal Learning to Rank via Latent Joint Representation , 2015, IEEE Transactions on Image Processing.

[83]  Yi Yang,et al.  Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[84]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[85]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[86]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[87]  Yanjun Qi,et al.  Learning to rank with (a lot of) word features , 2010, Information Retrieval.

[88]  Ling Shao,et al.  Cross-Modality Submodular Dictionary Learning for Information Retrieval , 2014, CIKM.

[89]  Zhou Yu,et al.  Sparse Multi-Modal Hashing , 2014, IEEE Transactions on Multimedia.

[90]  Pierre Vandergheynst,et al.  Learning Multimodal Dictionaries , 2007, IEEE Transactions on Image Processing.

[91]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  Jinhui Tang,et al.  Generalized Deep Transfer Networks for Knowledge Propagation in Heterogeneous Domains , 2016, ACM Trans. Multim. Comput. Commun. Appl..

[93]  Jonghyun Choi,et al.  Predictable Dual-View Hashing , 2013, ICML.

[94]  Luo Si,et al.  Learning to Hash on Partial Multi-Modal Data , 2015, IJCAI.

[95]  Trevor Darrell,et al.  Factorized Latent Spaces with Structured Sparsity , 2010, NIPS.

[96]  Jian Wang,et al.  Cross-Modal Retrieval via Deep and Bidirectional Representation Learning , 2016, IEEE Transactions on Multimedia.

[97]  Wenwu Zhu,et al.  Deep Multimodal Hashing with Orthogonal Regularization , 2015, IJCAI.

[98]  Yueting Zhuang,et al.  Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval , 2014, ACM Multimedia.

[99]  Chong-Wah Ngo,et al.  Click-through-based Subspace Learning for Image Search , 2014, ACM Multimedia.

[100]  Zhou Yu,et al.  Discriminative coupled dictionary hashing for fast cross-media retrieval , 2014, SIGIR.

[101]  Hagai Attias,et al.  Topic regression multi-modal Latent Dirichlet Allocation for image annotation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[102]  Daniel D. Lee,et al.  Semisupervised alignment of manifolds , 2005, AISTATS.

[103]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[104]  Christoph H. Lampert,et al.  Learning Multi-View Neighborhood Preserving Projections , 2011, ICML.

[105]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[106]  Quan Pan,et al.  Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[107]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.