Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval

People have witnessed the swift development of multimedia devices and multimedia technologies in recent years. How to catch interesting and highly relevant information from the magnanimous multimedia data becomes an urgent and challenging matter. To obtain more accurate retrieval results, researchers naturally think of using more fine-grained features to evaluate the similarity among multimedia samples. In this paper, we propose a Deep Attentional Fine-grained Similarity Network (DAFSN) for cross-modal retrieval, which is optimized in an adversarial learning manner. The DAFSN model consists of two subnetworks, attentional fine-grained similarity network for aligned representation learning and modal discriminative network. The front subnetwork adopts Bi-directional Long Short-Term Memory (LSTM) and pre-trained Inception-v3 model to extract text features and image features. In aligned representation learning, we consider not only the sentence-level pair-matching constraint but also the fine-grained similarity between word-level features of text description and sub-regional features of an image. The modal discriminative network aims to minimize the “heterogeneity gap” between text features and image features in an adversarial manner. We do experiments on several widely used datasets to verify the performance of the proposed DAFSN. The experimental results show that the DAFSN obtains better retrieval results based on the MAP metric. Besides, the result analyses and visual comparisons are presented in the experimental section.

[1]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[2]  Devraj Mandal,et al.  Generalized Semantic Preserving Hashing for N-Label Cross-Modal Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[4]  Jianping Gou,et al.  Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity , 2019, Multimedia Tools and Applications.

[5]  Qingming Huang,et al.  Cross-Modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation , 2014, IEEE Transactions on Multimedia.

[6]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[7]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[8]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[9]  Qingming Huang,et al.  Multi-label double-layer learning for cross-modal retrieval , 2018, Neurocomputing.

[10]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[11]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[13]  Jian Wang,et al.  Cross-Modal Retrieval via Deep and Bidirectional Representation Learning , 2016, IEEE Transactions on Multimedia.

[14]  Geoffrey E. Hinton Deep Belief Nets , 2017, Encyclopedia of Machine Learning and Data Mining.

[15]  Jing Ou,et al.  Network threat detection based on correlation analysis of multi-platform multi-source alert data , 2018, Multimedia Tools and Applications.

[16]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jun Wang,et al.  Noname manuscript No. (will be inserted by the editor) Bridging Memory-Based Collaborative Filtering and Text Retrieval , 2022 .

[22]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dongping Tian,et al.  Support Vector Machine for Content-based Image Retrieval: A Comprehensive Overview , 2018, J. Inf. Hiding Multim. Signal Process..

[24]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[26]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[27]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Azriel Grysman Collecting Narrative Data on Amazon's Mechanical Turk , 2015 .

[29]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[30]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[31]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[33]  Qingming Huang,et al.  Online Asymmetric Similarity Learning for Cross-Modal Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[35]  Lei Zhu,et al.  Adversarial cross-modal retrieval based on dictionary learning , 2019, Neurocomputing.

[36]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Changsheng Xu,et al.  Multi-Level Correlation Adversarial Hashing for Cross-Modal Retrieval , 2020, IEEE Transactions on Multimedia.

[38]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Xi Zhang,et al.  Attention-Aware Deep Adversarial Hashing for Cross-Modal Retrieval , 2017, ECCV.

[40]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Yuxin Peng,et al.  MHTN: Modal-Adversarial Hybrid Transfer Network for Cross-Modal Retrieval , 2017, IEEE Transactions on Cybernetics.

[42]  Yuxin Peng,et al.  Cross-modal Common Representation Learning by Hybrid Transfer Network , 2017, IJCAI.

[43]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[44]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[45]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Fedor Ratnikov,et al.  Generative Models for Fast Calorimeter Simulation.LHCb case , 2018, ArXiv.

[48]  Rong Jin,et al.  Understanding bag-of-words model: a statistical framework , 2010, Int. J. Mach. Learn. Cybern..

[49]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[50]  Yoshua Bengio,et al.  Fine-grained attention mechanism for neural machine translation , 2018, Neurocomputing.

[51]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[52]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[53]  Huimin Lu,et al.  Deep adversarial metric learning for cross-modal retrieval , 2019, World Wide Web.