论文信息 - Deep Learning Generic Features for Cross-Media Retrieval - 字舞流文

Deep Learning Generic Features for Cross-Media Retrieval

Cross-media retrieval is an imperative approach to handle the explosive growth of multimodal data on the web. However, how to effectively uncover the correlations between multimodal data has been a barrier to successful retrieval of cross-media data. The traditional approaches learn the connection between multiple modalities by direct utilization of hand-crafted low-level heterogeneous features and the learned correlation are merely constructed in terms of high-level feature representation. To well exploit the intrinsic structures of multimodal data, it is essential to build up an interpretable correlation between multimodal data. In this paper, we propose a deep model to learn the high-level feature representation shared by multiple modalities for cross-media retrieval. We learn the discriminative high-level feature representation in a data-driven manner before faithfully encoding the multimodal correlations. We use the large-scale multimodal data crawled from Internet to train our deep model and evaluate its effectiveness on cross-media retrieval based on NUS-WIDE dataset. The experimental results show that the proposed model outperforms other state-of-the-arts approaches.

Tat-Seng Chua | Hanwang Zhang | Xindi Shang | Hanwang Zhang | Tat-Seng Chua | Xindi Shang

[1] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[2] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[3] Quoc V. Le,et al. ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning , 2011, NIPS.

[4] Thomas Hofmann,et al. Greedy Layer-Wise Training of Deep Networks , 2007 .

[5] John Shawe-Taylor,et al. Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[6] Jian Dong,et al. Robust image annotation via simultaneous feature and sample outlier pursuit , 2013, TOMCCAP.

[7] John Langford,et al. Multi-Label Prediction via Compressed Sensing , 2009, NIPS.

[8] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[9] Paul Smolensky,et al. Information processing in dynamical systems: foundations of harmony theory , 1986 .

[10] Trevor Darrell,et al. Factorized Latent Spaces with Structured Sparsity , 2010, NIPS.

[11] Yan Liu,et al. Latent feature learning in social media network , 2013, ACM Multimedia.

[12] Yueting Zhuang,et al. Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[13] Tijmen Tieleman,et al. Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[14] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[15] Tat-Seng Chua,et al. NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[16] Jeff G. Schneider,et al. Multi-Label Output Codes using Canonical Correlation Analysis , 2011, AISTATS.

[17] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[18] Roger Levy,et al. A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[19] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[20] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.