论文信息 - Deep correspondence restricted Boltzmann machine for cross-modal retrieval - 字舞流文

Deep correspondence restricted Boltzmann machine for cross-modal retrieval

The task of cross-modal retrieval, i.e., using a text query to search for images or vice versa, has received considerable attention with the rapid growth of multi-modal web data. Modeling the correlations between different modalities is the key to tackle this problem. In this paper, we propose a correspondence restricted Boltzmann machine (Corr-RBM) to map the original features of bimodal data, such as image and text in our setting, into a low-dimensional common space, in which the heterogeneous data are comparable. In our Corr-RBM, two RBMs built for image and text, respectively are connected at their individual hidden representation layers by a correlation loss function. A single objective function is constructed to trade off the correlation loss and likelihoods of both modalities. Through the optimization of this objective function, our Corr-RBM is able to capture the correlations between two modalities and learn the representation of each modality simultaneously. Furthermore, we construct two deep neural structures using Corr-RBM as the main building block for the task of cross-modal retrieval. A number of comparison experiments are performed on three public real-world data sets. All of our models show significantly better results than state-of-the-art models in both searching images via text query and vice versa.

Ruifan Li | Xiaojie Wang | Fangxiang Feng | Fangxiang Feng | Ruifan Li | Xiaojie Wang

[1] Roger Levy,et al. A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[2] Jürgen Schmidhuber,et al. Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Gert R. G. Lanckriet,et al. Partial order embedding with multiple kernels , 2009, ICML '09.

[4] Hui Li,et al. Improving mixing rate with tempered transition for learning restricted Boltzmann machines , 2014, Neurocomputing.

[5] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[6] Pascal Vincent,et al. Parallel Tempering for Training of Restricted Boltzmann Machines , 2010 .

[7] TorralbaAntonio,et al. Modeling the Shape of the Scene , 2001 .

[8] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[9] Paul Smolensky,et al. Information processing in dynamical systems: foundations of harmony theory , 1986 .

[10] B. S. Manjunath,et al. Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[11] Tat-Seng Chua,et al. NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[12] Geoffrey E. Hinton,et al. Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[13] Nitish Srivastava,et al. Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[14] H. Hotelling. Relations Between Two Sets of Variates , 1936 .

[15] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[16] Geoffrey E. Hinton,et al. Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[17] Nitish Srivastava,et al. Modeling Documents with Deep Boltzmann Machines , 2013, UAI.

[18] Nikos Paragios,et al. Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19] Beng Chin Ooi,et al. Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[20] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[21] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.

[22] Raghavendra Udupa,et al. Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[23] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[24] Pascal Vincent,et al. Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[25] Antonio Torralba,et al. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[26] Tijmen Tieleman,et al. Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[27] Iryna Gurevych,et al. Learning Semantics with Deep Belief Network for Cross-Language Information Retrieval , 2012, COLING.

[28] Geoffrey E. Hinton,et al. Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[29] Ruifan Li,et al. Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[30] Geoffrey E. Hinton,et al. Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[31] Geoffrey E. Hinton,et al. Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[32] Bart Thomee,et al. New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[33] Zi Huang,et al. Linear cross-modal hashing for efficient multimedia search , 2013, ACM Multimedia.

[34] Andrew Zisserman,et al. Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.