Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching
暂无分享,去创建一个
Heng Tao Shen | Alan Hanjalic | Yang Yang | Xing Xu | Yifan Wang | Yixuan He | H. Shen | A. Hanjalic | Xing Xu | Yang Yang | Yixuan He | Yifan Wang
[1] Zhoujun Li,et al. Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching , 2019, IEEE Transactions on Image Processing.
[2] Yi Yang,et al. Modality-Invariant Image-Text Embedding for Image-Sentence Matching , 2019, ACM Trans. Multim. Comput. Commun. Appl..
[3] Yale Song,et al. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[5] Xuelong Li,et al. Stacked squeeze-and-excitation recurrent residual network for visual-semantic matching , 2020, Pattern Recognit..
[6] Liqiang Nie,et al. Neural Multimodal Cooperative Learning Toward Micro-Video Understanding , 2020, IEEE Transactions on Image Processing.
[7] Xiaogang Wang,et al. Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Ioannis A. Kakadiaris,et al. Adversarial Representation Learning for Text-to-Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[9] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.
[10] Qing Li,et al. Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval , 2019, ACM Trans. Multim. Comput. Commun. Appl..
[11] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[12] Qi Tian,et al. Joint Global and Co-Attentive Representation Learning for Image-Sentence Retrieval , 2018, ACM Multimedia.
[13] Gabriela Csurka,et al. Semantic combination of textual and visual information in multimedia retrieval , 2011, ICMR.
[14] Wei-Ying Ma,et al. Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Peng Gao,et al. Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Yang Gao,et al. Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .
[18] Liwei Wang,et al. Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[19] Meng Wang,et al. Towards Micro-video Understanding by Joint Sequential-Sparse Modeling , 2017, ACM Multimedia.
[20] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[21] Qi Wu,et al. Image and Sentence Matching via Semantic Concepts and Order Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[22] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.
[23] Yao Zhao,et al. Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.
[24] Erik Cambria,et al. Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.
[25] Jianhai Zhang,et al. Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling , 2019, NeurIPS.
[26] Yan Huang,et al. Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[27] Huchuan Lu,et al. Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.
[28] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[29] E. Miller,et al. Top-Down Versus Bottom-Up Control of Attention in the Prefrontal and Posterior Parietal Cortices , 2007, Science.
[30] Ying Zhang,et al. Consensus-Aware Visual-Semantic Embedding for Image-Text Matching , 2020, ECCV.
[31] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.
[32] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.
[33] Jie Chen,et al. Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[34] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[35] Yuxin Peng,et al. CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning , 2021 .
[36] Yongdong Zhang,et al. Multi-Modality Cross Attention Network for Image and Sentence Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Jian Yang,et al. Occluded Pedestrian Detection Through Guided Attention in CNNs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[38] Ji Liu,et al. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Yu Liu,et al. Learning a Recurrent Residual Fusion Network for Multimodal Matching , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[40] Rui Cao,et al. Information fusion in visual question answering: A Survey , 2019, Inf. Fusion.
[41] Lei Zhang,et al. Image Captioning with a Joint Attention Mechanism by Visual Concept Samples , 2020, ACM Trans. Multim. Comput. Commun. Appl..
[42] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[43] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Huimin Lu,et al. Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval , 2020, IEEE Transactions on Cybernetics.
[45] Xin Xu,et al. Cross-Modality Retrieval by Joint Correlation Learning , 2019, ACM Trans. Multim. Comput. Commun. Appl..
[46] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.
[47] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[48] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.
[49] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[50] Xiaogang Wang,et al. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[51] Yuxin Peng,et al. Sequential Cross-Modal Hashing Learning via Multi-scale Correlation Mining , 2020, ACM Trans. Multim. Comput. Commun. Appl..
[52] Cédric Westphal,et al. Scalable Routing Via Greedy Embedding , 2009, IEEE INFOCOM 2009.
[53] Yorick Wilks,et al. A Closer Look at Skip-gram Modelling , 2006, LREC.
[54] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[55] Yoshua Bengio,et al. Fine-grained attention mechanism for neural machine translation , 2018, Neurocomputing.
[56] Wei Chen,et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.
[57] Heng Tao Shen,et al. Joint Feature Synthesis and Embedding: Adversarial Cross-Modal Retrieval Revisited , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[58] Xin Wang,et al. Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning , 2018, NAACL.
[59] Wei Wang,et al. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Xilin Chen,et al. Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[61] Yang Yang,et al. Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking , 2019, ACM Multimedia.
[62] Matthieu Cord,et al. BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection , 2019, AAAI.
[63] Xiaogang Wang,et al. Identity-Aware Textual-Visual Matching with Latent Co-attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[64] Louis-Philippe Morency,et al. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors , 2018, ACL.