Learning the Best Pooling Strategy for Visual Semantic Embedding
暂无分享,去创建一个
[1] Yale Song,et al. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Giorgos Tolias,et al. Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[3] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[4] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[5] Wei-Lun Chao,et al. Synthesized Classifiers for Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Samy Bengio,et al. Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.
[8] Jung-Woo Ha,et al. Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Jonathon S. Hare,et al. FSPool: Learning Set Representations with Featurewise Sort Pooling , 2019, ICLR.
[10] Christoph H. Lampert,et al. Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[11] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[12] Jason Baldridge,et al. Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO , 2020, EACL.
[13] Jeff Johnson,et al. Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.
[14] David J. Fleet,et al. VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.
[15] Wei-Ying Ma,et al. Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Ahmed El Kholy,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.
[17] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[18] Jonatas Wehrmann,et al. Language-Agnostic Visual-Semantic Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[19] Xinlei Chen,et al. In Defense of Grid Features for Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Ying Zhang,et al. Consensus-Aware Visual-Semantic Embedding for Image-Text Matching , 2020, ECCV.
[21] Gang Wang,et al. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[22] Qi Zhang,et al. Context-Aware Attention Network for Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[24] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[25] Yan Huang,et al. ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[26] Shizhe Chen,et al. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.
[29] Fei Sha,et al. Learning to Represent Image and Text with Denotation Graphs , 2020, EMNLP.
[30] Marc Van Droogenbroeck,et al. Ordinal Pooling , 2019, BMVC.
[31] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[32] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[33] Kaiming He,et al. Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.
[34] Quoc V. Le,et al. Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.
[35] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.
[36] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.
[37] Yun Fu,et al. Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[38] Yan Huang,et al. Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[39] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.
[40] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[41] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[42] Ji Liu,et al. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[43] Phil Blunsom,et al. A Convolutional Neural Network for Modelling Sentences , 2014, ACL.
[44] Jian Sun,et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[45] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[46] Hexiang Hu,et al. Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding , 2019, ArXiv.
[47] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.
[49] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[50] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[51] Aviv Eisenschtat,et al. Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.