Deep correspondence restricted Boltzmann machine for cross-modal retrieval

The task of cross-modal retrieval, i.e., using a text query to search for images or vice versa, has received considerable attention with the rapid growth of multi-modal web data. Modeling the correlations between different modalities is the key to tackle this problem. In this paper, we propose a correspondence restricted Boltzmann machine (Corr-RBM) to map the original features of bimodal data, such as image and text in our setting, into a low-dimensional common space, in which the heterogeneous data are comparable. In our Corr-RBM, two RBMs built for image and text, respectively are connected at their individual hidden representation layers by a correlation loss function. A single objective function is constructed to trade off the correlation loss and likelihoods of both modalities. Through the optimization of this objective function, our Corr-RBM is able to capture the correlations between two modalities and learn the representation of each modality simultaneously. Furthermore, we construct two deep neural structures using Corr-RBM as the main building block for the task of cross-modal retrieval. A number of comparison experiments are performed on three public real-world data sets. All of our models show significantly better results than state-of-the-art models in both searching images via text query and vice versa.

[1]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[2]  Jürgen Schmidhuber,et al.  Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Gert R. G. Lanckriet,et al.  Partial order embedding with multiple kernels , 2009, ICML '09.

[4]  Hui Li,et al.  Improving mixing rate with tempered transition for learning restricted Boltzmann machines , 2014, Neurocomputing.

[5]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[6]  Pascal Vincent,et al.  Parallel Tempering for Training of Restricted Boltzmann Machines , 2010 .

[7]  TorralbaAntonio,et al.  Modeling the Shape of the Scene , 2001 .

[8]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[9]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[10]  B. S. Manjunath,et al.  Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[11]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[12]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[13]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[14]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[15]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[16]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[17]  Nitish Srivastava,et al.  Modeling Documents with Deep Boltzmann Machines , 2013, UAI.

[18]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Beng Chin Ooi,et al.  Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[20]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[21]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[22]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[23]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[24]  Pascal Vincent,et al.  Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[25]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[26]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[27]  Iryna Gurevych,et al.  Learning Semantics with Deep Belief Network for Cross-Language Information Retrieval , 2012, COLING.

[28]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[29]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[30]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[31]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[32]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[33]  Zi Huang,et al.  Linear cross-modal hashing for efficient multimedia search , 2013, ACM Multimedia.

[34]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.