Modal-aware Features for Multimodal Hashing

Many retrieval applications can benefit from multiple modalities, e.g., text that contains images on Wikipedia, for which how to represent multimodal data is the critical component. Most deep multimodal learning methods typically involve two steps to construct the joint representations: 1) learning of multiple intermediate features, with each intermediate feature corresponding to a modality, using separate and independent deep models; 2) merging the intermediate features into a joint representation using a fusion strategy. However, in the first step, these intermediate features do not have previous knowledge of each other and cannot fully exploit the information contained in the other modalities. In this paper, we present a modal-aware operation as a generic building block to capture the non-linear dependences among the heterogeneous intermediate features that can learn the underlying correlation structures in other multimodal data as soon as possible. The modal-aware operation consists of a kernel network and an attention network. The kernel network is utilized to learn the non-linear relationships with other modalities. Then, to learn better representations for binary hash codes, we present an attention network that finds the informative regions of these modal-aware features that are favorable for retrieval. Experiments conducted on three public benchmark datasets demonstrate significant improvements in the performance of our method relative to state-of-the-art methods.

[1]  Wei Liu,et al.  Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval , 2017, AAAI.

[2]  Jianping Yin,et al.  Multiple Kernel Learning in the Primal for Multimodal Alzheimer’s Disease Classification , 2013, IEEE Journal of Biomedical and Health Informatics.

[3]  Fumin Shen,et al.  Multi-view Latent Hashing for Efficient Multimedia Search , 2015, ACM Multimedia.

[4]  Sidney K. D'Mello,et al.  A Review and Meta-Analysis of Multimodal Affect Detection Systems , 2015, ACM Comput. Surv..

[5]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Seungjin Choi,et al.  Sequential Spectral Learning to Hash with Multiple Representations , 2012, ECCV.

[9]  Roland Göcke,et al.  Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[10]  Ling Shao,et al.  Dynamic Multi-View Hashing for Online Image Retrieval , 2017, IJCAI.

[11]  Wu-Jun Li,et al.  Feature Learning Based Deep Supervised Hashing with Pairwise Labels , 2015, IJCAI.

[12]  Xi Zhang,et al.  Attention-Aware Deep Adversarial Hashing for Cross-Modal Retrieval , 2017, ECCV.

[13]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[14]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Youngbae Hwang,et al.  Robust Deep Multi-modal Learning Based on Gated Information Fusion Network , 2018, ACCV.

[16]  Hugo Jair Escalante,et al.  The segmented and annotated IAPR TC-12 benchmark , 2010, Comput. Vis. Image Underst..

[17]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[18]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[21]  Wei Wang,et al.  A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[22]  Jieping Ye,et al.  A least squares formulation for canonical correlation analysis , 2008, ICML '08.

[23]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[24]  Sanja Fidler,et al.  A Sentence Is Worth a Thousand Pixels , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[26]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[27]  Xiaogang Wang,et al.  Multi-source Deep Learning for Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[29]  Fei Wang,et al.  Composite hashing with multiple information sources , 2011, SIGIR.

[30]  Hanjiang Lai,et al.  Simultaneous feature learning and hash coding with deep neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Philip S. Yu,et al.  HashNet: Deep Learning to Hash by Continuation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[33]  Degui Xiao,et al.  Medical Image Retrieval: A Multimodal Approach , 2014, Cancer informatics.

[34]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[35]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Wenwu Zhu,et al.  Deep Multimodal Hashing with Orthogonal Regularization , 2015, IJCAI.

[37]  Dongqing Zhang,et al.  Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.

[38]  Trevor Darrell,et al.  Segmentation from Natural Language Expressions , 2016, ECCV.

[39]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[40]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Jinhui Tang,et al.  Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval , 2019, ArXiv.

[42]  Shiguang Shan,et al.  Deep Supervised Hashing for Fast Image Retrieval , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Ning Xu,et al.  Learn to Combine Modalities in Multimodal Deep Learning , 2018, ArXiv.

[44]  Wei Liu,et al.  Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.