暂无分享,去创建一个
Zhe Gan | Shuohang Wang | Michael Zeng | Mohit Bansal | Zicheng Liu | Chenguang Zhu | Linjie Li | Lijuan Wang | Yixin Nie | Mohit Bansal | Zhe Gan | Shuohang Wang | Lijuan Wang | Zicheng Liu | Chenguang Zhu | Michael Zeng | Yi-Liang Nie | Linjie Li
[1] Kurt Hornik,et al. Multilayer feedforward networks are universal approximators , 1989, Neural Networks.
[2] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[3] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[4] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[5] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[6] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.
[7] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .
[8] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[10] Michael S. Bernstein,et al. Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[12] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Asim Kadav,et al. Visual Entailment Task for Visually-Grounded Language Learning , 2018, ArXiv.
[15] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[16] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[17] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Hongyuan Zha,et al. A Fast Proximal Point Method for Computing Exact Wasserstein Distance , 2018, UAI.
[19] Asim Kadav,et al. Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.
[20] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[21] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[22] Xinlei Chen,et al. Cycle-Consistency for Robust Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[24] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.
[25] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[26] Chitta Baral,et al. VQA-LOL: Visual Question Answering under the Lens of Logic , 2020, European Conference on Computer Vision.
[27] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[28] Yu Cheng,et al. Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models , 2020, ECCV.
[29] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.
[30] Zhe Gan,et al. A Closer Look at the Robustness of Vision-and-Language Pre-trained Models , 2020, ArXiv.
[31] Licheng Yu,et al. Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.
[32] Jianlong Fu,et al. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.
[33] Jason J. Corso,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.
[34] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[35] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[36] J. Weston,et al. Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.
[37] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[38] Luke Melas-Kyriazi,et al. Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet , 2021, ArXiv.
[39] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.
[40] Masato Taki,et al. RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? , 2021, ArXiv.
[41] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[42] A. Dosovitskiy,et al. MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.
[43] Jianlong Fu,et al. Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Wenhu Chen,et al. Meta Module Network for Compositional Visual Reasoning , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[45] Jaesung Tae,et al. MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis , 2021, 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP).
[46] Matthieu Cord,et al. ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[47] Kai Han,et al. Hire-MLP: Vision MLP via Hierarchical Rearrangement , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[49] Zhe Gan,et al. VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.
[50] Shenghua Gao,et al. AS-MLP: An Axial Shifted MLP Architecture for Vision , 2021, ICLR.
[51] Zhiyi Ma,et al. Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.
[52] Yi Tay,et al. Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.
[53] Jean-Baptiste Alayrac,et al. Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers , 2021, Transactions of the Association for Computational Linguistics.
[54] Kurt Keutzer,et al. How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.
[55] Christian Wolf,et al. Roses are Red, Violets are Blue… But Should VQA expect Them To? , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[56] Douwe Kiela,et al. Human-Adversarial Visual Question Answering , 2021, NeurIPS.
[57] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[58] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Lei Zhang,et al. VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.
[60] Zhe Gan,et al. Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[61] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[62] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, ArXiv.
[63] Quoc V. Le,et al. Pay Attention to MLPs , 2021, NeurIPS.
[64] Ping Luo,et al. CycleMLP: A MLP-like Architecture for Dense Prediction , 2021, ICLR.
[65] Haitao Zheng,et al. Are we ready for a new paradigm shift? A survey on visual deep MLP , 2021, Patterns.
[66] Shuicheng Yan,et al. Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[67] Yunfeng Cai,et al. S2-MLP: Spatial-Shift MLP Architecture for Vision , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).