Multi-level network based on transformer encoder for fine-grained image–text matching

[1]  Joemon M. Jose,et al.  Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval , 2021, ACM Multimedia.

[2]  Liqiang Nie,et al.  Dynamic Modality Interaction Modeling for Image-Text Retrieval , 2021, SIGIR.

[3]  Houqiang Li,et al.  Deep Relation Embedding for Cross-Modal Retrieval , 2020, IEEE Transactions on Image Processing.

[4]  Liqiang Nie,et al.  Context-Aware Multi-View Summarization Network for Image-Text Matching , 2020, ACM Multimedia.

[5]  Andrea Esuli,et al.  Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[6]  Yuxin Peng,et al.  MAVA: Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism , 2019, IEEE Transactions on Image Processing.

[7]  Qingming Huang,et al.  Learning Fragment Self-Attention Embeddings for Image-Text Matching , 2019, ACM Multimedia.

[8]  Yi Li,et al.  Learning discriminative representations for semantical crossmodal retrieval , 2018, Multimedia Systems.

[9]  Qingming Huang,et al.  Learning Semantic Structure-preserved Embeddings for Cross-modal Retrieval , 2018, ACM Multimedia.

[10]  Amit K. Roy-Chowdhury,et al.  Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval , 2018, ACM Multimedia.

[11]  Qi Tian,et al.  Multi-Networks Joint Learning for Large-Scale Cross-Modal Retrieval , 2017, ACM Multimedia.

[12]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[13]  Yuxin Peng,et al.  CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning , 2017, ArXiv.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[16]  Xinhang Song,et al.  Relative image similarity learning with contextual information for Internet cross-media retrieval , 2014, Multimedia Systems.

[17]  A. Yazıcı,et al.  RELIEF-MM: effective modality weighting for multimedia information retrieval , 2014, Multimedia Systems.

[18]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[19]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[20]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[21]  Li Fei-Fei,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).