Cross-Media Retrieval by Multimodal Representation Fusion with Deep Networks

With the development of computer network, multimedia and digital transmission technology in recent years, the traditional form of information dissemination which mainly depends on text has changed to the multimedia form including texts, images, videos, audios and so on. Under this situation, to meet the growing demand of users for access to multimedia information, cross-media retrieval has become a key problem of research and application. Given queries of any media type, cross-media retrieval can return all relevant media types as results with similar semantics. For measuring the similarity between different media types, it is important to learn better shared representation for multimedia data. Existing methods mainly extract single modal representation for each media type and then learn the cross-media correlations with pairwise similar constraint, which cannot make full use of the rich information within each media type and ignore the dissimilar constraints between different media types. For addressing the above problems, this paper proposes a deep multimodal learning method (DML) for cross-media shared representation learning. First, we adopt two different deep networks for each media type with multimodal learning, which can obtain the high-level semantic representation of single media. Then, a two-pathway network is constructed by jointly modeling the pairwise similar and dissimilar constraints with a contrastive loss to get the shared representation. The experiments are conducted on two widely-used cross-media datasets, which shows the effectiveness of our proposed method. abstract environment.

[1]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[2]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[3]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[4]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[5]  Xiaohua Zhai,et al.  Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval , 2013, AAAI.

[6]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[7]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[8]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[9]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Geoffrey E. Hinton,et al.  An Efficient Learning Procedure for Deep Boltzmann Machines , 2012, Neural Computation.

[11]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[12]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[13]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[14]  Qi Tian,et al.  Semantic Subspace Projection and Its Applications in Image Retrieval , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[16]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[17]  Yuxin Peng,et al.  Clip-based similarity measure for query-dependent clip retrieval and video summarization , 2006, IEEE Trans. Circuits Syst. Video Technol..

[18]  Remco C. Veltkamp,et al.  A Survey of Music Information Retrieval Systems , 2005, ISMIR.

[19]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[20]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.