ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
暂无分享,去创建一个
Marcella Cornia | L. Baraldi | R. Cucchiara | G. Amato | F. Falchi | Nicola Messina | Matteo Stefanini
[1] Marcella Cornia,et al. CaMEL: Mean Teacher Learning for Image Captioning , 2022, 2022 26th International Conference on Pattern Recognition (ICPR).
[2] Tejas Gokhale,et al. Weakly Supervised Relative Spatial Reasoning for Visual Question Answering , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[3] Rita Cucchiara,et al. From Show to Tell: A Survey on Deep Learning-Based Image Captioning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[4] Yejin Choi,et al. VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Claudio Gennaro,et al. Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features , 2021, 2021 International Conference on Content-Based Multimedia Indexing (CBMI).
[6] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[7] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[8] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[9] Qingrong Cheng,et al. Learning Dual Semantic Relations With Graph Attention for Image-Text Matching , 2020, IEEE Transactions on Circuits and Systems for Video Technology.
[10] Liqiang Nie,et al. Context-Aware Multi-View Summarization Network for Image-Text Matching , 2020, ACM Multimedia.
[11] Andrea Esuli,et al. Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders , 2020, ACM Trans. Multim. Comput. Commun. Appl..
[12] H. R. Tavakoli,et al. A unified cycle-consistent neural model for text and image retrieval , 2020, Multimedia Tools and Applications.
[13] Radoslaw Bialobrzeski,et al. Context-Aware Learning to Rank with Self-Attention , 2020, ArXiv.
[14] Andrea Esuli,et al. Transformer Reasoning Network for Image- Text Matching and Retrieval , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).
[15] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[16] Marcella Cornia,et al. A Novel Attention-based Aggregation Function to Combine Vision and Language , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).
[17] Hanwang Zhang,et al. More Grounded Image Captioning by Distilling Image-Text Matching Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Qi Wu,et al. Image and Sentence Matching via Semantic Concepts and Order Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[19] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.
[20] Marcella Cornia,et al. Meshed-Memory Transformer for Image Captioning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Marcus Rohrbach,et al. 12-in-1: Multi-Task Vision and Language Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Sebastian Bruch,et al. An Alternative Cross Entropy Loss for Learning-to-Rank , 2019, WWW.
[23] Quoc V. Le,et al. Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[25] Qingming Huang,et al. Learning Fragment Self-Attention Embeddings for Image-Text Matching , 2019, ACM Multimedia.
[26] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[27] Jason J. Corso,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.
[28] Jianfeng Gao,et al. Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators , 2019, ArXiv.
[29] Yun Fu,et al. Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[30] Ioannis A. Kakadiaris,et al. Adversarial Representation Learning for Text-to-Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[32] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.
[33] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[34] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[35] Kaisheng Ma,et al. Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[36] Geoffrey E. Hinton,et al. Large scale distributed neural network training through online distillation , 2018, ICLR.
[37] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[38] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.
[39] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[40] Tie-Yan Liu,et al. Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.
[41] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.