论文信息 - Deep Visual-Semantic Hashing for Cross-Modal Retrieval

Deep Visual-Semantic Hashing for Cross-Modal Retrieval

Due to the storage and retrieval efficiency, hashing has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval. Cross-modal hashing, which enables efficient retrieval of images in response to text queries or vice versa, has received increasing attention recently. Most existing work on cross-modal hashing does not capture the spatial dependency of images and temporal dynamics of text sentences for learning powerful feature representations and cross-modal embeddings that mitigate the heterogeneity of different modalities. This paper presents a new Deep Visual-Semantic Hashing (DVSH) model that generates compact hash codes of images and sentences in an end-to-end deep learning architecture, which capture the intrinsic cross-modal correspondences between visual data and natural language. DVSH is a hybrid deep architecture that constitutes a visual-semantic fusion network for learning joint embedding space of images and text sentences, and two modality-specific hashing networks for learning hash functions to generate compact binary codes. Our architecture effectively unifies joint multimodal embedding and cross-modal hashing, which is based on a novel combination of Convolutional Neural Networks over images, Recurrent Neural Networks over sentences, and a structured max-margin objective that integrates all things together to enable learning of similarity-preserving and high-quality hash codes. Extensive empirical evidence shows that our DVSH approach yields state of the art results in cross-modal retrieval experiments on image-sentences datasets, i.e. standard IAPR TC-12 and large-scale Microsoft COCO.

[1] Jianmin Wang,et al. Semantics-preserving hashing for cross-view retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Heng Tao Shen,et al. Hashing for Similarity Search: A Survey , 2014, ArXiv.

[3] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[4] Raghavendra Udupa,et al. Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[5] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[6] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[7] Xianglong Liu,et al. Collaborative Hashing , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8] Jürgen Schmidhuber,et al. Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] Bernhard Schölkopf,et al. A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[10] Dan Klein,et al. Deep Compositional Question Answering with Neural Module Networks , 2015, ArXiv.

[11] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[12] Yi Zhen,et al. Co-Regularized Hashing for Multimodal Data , 2012, NIPS.

[13] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[15] Dongqing Zhang,et al. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.

[16] Zhou Yu,et al. Discriminative coupled dictionary hashing for fast cross-media retrieval , 2014, SIGIR.

[17] Hanjiang Lai,et al. Supervised Hashing for Image Retrieval via Image Representation Learning , 2014, AAAI.

[18] Yao Hu,et al. Iterative Multi-View Hashing for Cross Media Indexing , 2014, ACM Multimedia.

[19] Marcel Worring,et al. Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[20] Jun Wang,et al. Comparing apples to oranges: a scalable solution with heterogeneous hashing , 2013, KDD.

[21] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[22] Michael I. Jordan,et al. Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[23] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[24] Jianmin Wang,et al. Deep Quantization Network for Efficient Image Retrieval , 2016, AAAI.

[25] Ruslan Salakhutdinov,et al. Multimodal Neural Language Models , 2014, ICML.

[26] Hanjiang Lai,et al. Simultaneous feature learning and hash coding with deep neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Trevor Darrell,et al. Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Paul Clough,et al. The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[29] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[30] Nikos Paragios,et al. Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31] Jianmin Wang,et al. Deep Hashing Network for Efficient Similarity Retrieval , 2016, AAAI.

[32] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[33] Beng Chin Ooi,et al. Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[34] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[36] Wojciech Zaremba,et al. Learning to Execute , 2014, ArXiv.

[37] Yizhou Wang,et al. Quantized Correlation Hashing for Fast Cross-Modal Search , 2015, IJCAI.

[38] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39] Ruifan Li,et al. Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[40] Yi Zhen,et al. A probabilistic model for multimodal hash function learning , 2012, KDD.

[41] Philip S. Yu,et al. Composite Correlation Quantization for Efficient Multimodal Retrieval , 2015, SIGIR.

[42] Zi Huang,et al. Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[43] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[44] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[45] Rongrong Ji,et al. Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46] Qiang Yang,et al. Scalable heterogeneous translated hashing , 2014, KDD.

[47] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).