论文信息 - Learning Transferable Visual Models From Natural Language Supervision

Learning Transferable Visual Models From Natural Language Supervision

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.

[1] Quoc V. Le,et al. Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[3] Joel Nothman,et al. SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[4] C. V. Jawahar,et al. Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Larry S. Davis,et al. AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[6] Jean-Marc Odobez,et al. Topic models for scene analysis and abnormality detection , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[7] Andreas Dengel,et al. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[8] Yahia Saeed Assiri. Stochastic Optimization of Plain Convolutional Neural Networks with Simple Methods , 2019, MLDM.

[9] Bohyung Han,et al. CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps , 2018, ECCV.

[10] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[11] Cyrus Rashtchian,et al. Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[12] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[13] Adam Tauman Kalai,et al. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[14] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[15] Jason Weston,et al. Dialog-based Language Learning , 2016, NIPS.

[16] Zhi Zhang,et al. Bag of Tricks for Image Classification with Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Inioluwa Deborah Raji,et al. Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing , 2020, AIES.

[18] Fei-Fei Li,et al. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[20] Tom M. Mitchell,et al. Joint Concept Learning and Semantic Parsing from Natural Language Explanations , 2017, EMNLP.

[21] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[22] Ivan Laptev,et al. RareAct: A video dataset of unusual interactions , 2020, ArXiv.

[23] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[24] Geoffrey C. Bowker,et al. Unsupervised by any other name: Hidden layers of knowledge production in artificial intelligence on social media , 2019, Big Data & Society.

[25] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[26] Joshua B. Tenenbaum,et al. Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[27] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[28] Noah D. Goodman,et al. Shaping Visual Representations with Language for Few-Shot Classification , 2019, ACL.

[29] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[30] Richard Socher,et al. Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[31] Xiaohua Zhai,et al. A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark , 2019 .

[32] Alexander D'Amour,et al. Underspecification Presents Challenges for Credibility in Modern Machine Learning , 2020, J. Mach. Learn. Res..

[33] Karan Desai,et al. VirTex: Learning Visual Representations from Textual Annotations , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Pietro Perona,et al. Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[35] Ali Razavi,et al. Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[36] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[37] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[38] Quoc V. Le,et al. Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Antonio Torralba,et al. Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[41] Andreas Griewank,et al. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation , 2000, TOMS.

[42] K. Jarrod Millman,et al. Array programming with NumPy , 2020, Nat..

[43] Alexei A. Efros,et al. IM2GPS: estimating geographic information from a single image , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[44] Francesco Locatello,et al. A Sober Look at the Unsupervised Learning of Disentangled Representations and their Evaluation , 2020, J. Mach. Learn. Res..

[45] Andrew Y. Ng,et al. Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[46] Douwe Kiela,et al. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , 2020, NeurIPS.

[47] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[48] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[49] Benjamin Recht,et al. Measuring Robustness to Natural Distribution Shifts in Image Classification , 2020, NeurIPS.

[50] Andrew Zisserman,et al. Spatial Transformer Networks , 2015, NIPS.

[51] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .

[53] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[54] Lei Yu,et al. Learning and Evaluating General Linguistic Intelligence , 2019, ArXiv.

[55] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[56] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[57] Kimmo Kärkkäinen,et al. FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age , 2019, ArXiv.

[58] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[59] Mario Lucic,et al. Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[60] Ilya Sutskever,et al. Jukebox: A Generative Model for Music , 2020, ArXiv.

[61] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[62] Xiaoqiang Lu,et al. Remote Sensing Image Scene Classification: Benchmark and State of the Art , 2017, Proceedings of the IEEE.

[63] Xinlei Chen,et al. Webly Supervised Learning of Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[65] Ahmed El Kholy,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[66] Andreas Geiger,et al. Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[67] Danqi Chen,et al. Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL/IJCNLP.

[68] Richard Socher,et al. The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[69] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[70] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[71] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[72] Christoph H. Lampert,et al. Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[73] Dawn Song,et al. Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[74] Jason Weston,et al. Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[75] Kaiming He,et al. Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[76] Jason Weston,et al. Learning through Dialogue Interactions by Asking Questions , 2016, ICLR.

[77] Percy Liang,et al. ExpBERT: Representation Engineering with Natural Language Explanations , 2020, ACL.

[78] Phillip Isola,et al. Contrastive Multiview Coding , 2019, ECCV.

[79] Richard Zhang,et al. Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[80] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[82] Thomas G. Dietterich,et al. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[83] Sanja Fidler,et al. Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[84] Lina J. Karam,et al. A Study and Comparison of Human and Deep Learning Recognition Performance under Visual Distortions , 2017, 2017 26th International Conference on Computer Communication and Networks (ICCCN).

[85] Demis Hassabis,et al. Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[86] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[87] Katja Markert,et al. Learning Models for Object Recognition from Natural Language Descriptions , 2009, BMVC.

[88] Andrew Zisserman,et al. A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.

[89] Benjamin Recht,et al. The Effect of Natural Distribution Shift on Question Answering Models , 2020, ICML.

[90] Nathan Jacobs,et al. Revisiting IM2GPS in the Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[91] Yuxin Peng,et al. Fine-Grained Image Classification via Combining Vision and Language , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[92] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[93] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[94] Brian A. Nosek,et al. Harvesting implicit group attitudes and beliefs from a demonstration web site , 2002 .

[95] D. Song,et al. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[96] Zhitao Gong,et al. Strike (With) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[97] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[98] Aleksander Madry,et al. Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[99] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[100] Y. Mori,et al. Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[101] Jiebo Luo,et al. TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[102] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[103] Tal Linzen,et al. How Can We Accelerate Progress Towards Human-like Linguistic Generalization? , 2020, ACL.

[104] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[105] Matthias Bethge,et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[106] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[107] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[108] C. V. Jawahar,et al. Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[109] Sinan Kalkan,et al. Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition , 2020, ECCV Workshops.

[110] Christopher D. Manning,et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[111] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[112] Boris Katz,et al. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.

[113] Shruti Bhargava,et al. Exposing and Correcting the Gender Bias in Image Captioning Datasets and Models , 2019, ArXiv.

[114] Shoshana Zuboff,et al. Big other: surveillance capitalism and the prospects of an information civilization , 2015, J. Inf. Technol..

[115] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[116] Babak Saleh,et al. Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[117] Yue Wang,et al. Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need? , 2020, ECCV.

[118] Ralph Ewerth,et al. Geolocation Estimation of Photos Using a Hierarchical Model and Scene Classification , 2018, ECCV.

[119] Trevor Darrell,et al. Learning Visual Representations using Images with Captions , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[120] Allan Jabri,et al. Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[121] Carly R. Knight,et al. Diagnosing Gender Bias in Image Recognition Systems , 2020, Socius : sociological research for a dynamic world.

[122] C. V. Jawahar,et al. Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[123] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[124] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[125] Helen Yannakoudakis,et al. A Multimodal Framework for the Detection of Hateful Memes , 2020, ArXiv.

[126] Quoc V. Le,et al. Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[127] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[128] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[129] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[130] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[131] Fei-Fei Li,et al. Video Event Understanding Using Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[132] Allan Jabri,et al. Learning Visual N-Grams from Web Data , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[133] Christopher Ré,et al. Training Classifiers with Natural Language Explanations , 2018, ACL.

[134] Yoshua Bengio,et al. Challenges in representation learning: A report on three machine learning contests , 2013, Neural Networks.

[135] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[136] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[137] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[138] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[139] Colin Raffel,et al. Realistic Evaluation of Deep Semi-Supervised Learning Algorithms , 2018, NeurIPS.

[140] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[141] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[142] David A. Ross,et al. Learning Video Representations from Textual Web Supervision , 2020, ArXiv.

[143] Michal Valko,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[144] Jonathon Shlens,et al. Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[145] Jaehoon Lee,et al. On Empirical Comparisons of Optimizers for Deep Learning , 2019, ArXiv.

[146] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[147] R. Gonzales. Dark matters: on the surveillance of blackness , 2016 .

[148] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[149] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[150] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[151] Yoshua Bengio,et al. Zero-data Learning of New Tasks , 2008, AAAI.

[152] J. Overhage,et al. Sorting Things Out: Classification and Its Consequences , 2001, Annals of Internal Medicine.

[153] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[154] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[155] Regina Barzilay,et al. Language Understanding for Text-based Games using Deep Reinforcement Learning , 2015, EMNLP.

[156] Benjamin Recht,et al. Do Image Classifiers Generalize Across Time , 2019 .

[157] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[158] Max Welling,et al. Rotation Equivariant CNNs for Digital Pathology , 2018, MICCAI.

[159] Andrew Zisserman,et al. Deep Structured Output Learning for Unconstrained Text Recognition , 2014, ICLR.

[160] Alec Radford,et al. Release Strategies and the Social Impacts of Language Models , 2019, ArXiv.

[161] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[162] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[163] Dahua Lin,et al. Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination , 2018, ArXiv.

[164] Ilya Kostrikov,et al. PlaNet - Photo Geolocation with Convolutional Neural Networks , 2016, ECCV.

[165] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[166] David A. Forsyth,et al. Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[167] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[168] Mohammad Norouzi,et al. Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[169] Honglak Lee,et al. An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[170] Lukasz Kaiser,et al. Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[171] Timnit Gebru,et al. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[172] D. Fitch,et al. Review of "Algorithms of oppression: how search engines reinforce racism," by Noble, S. U. (2018). New York, New York: NYU Press. , 2018, CDQR.

[173] Julien Perez,et al. Learning Visual Representations with Caption Annotations , 2020, ECCV.

[174] Benjamin Recht,et al. Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[175] Yang Yang,et al. Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[176] M. Bethge,et al. Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[177] Johannes Stallkamp,et al. The German Traffic Sign Recognition Benchmark: A multi-class classification competition , 2011, The 2011 International Joint Conference on Neural Networks.

[178] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[179] Alexander Kuhnle,et al. ShapeWorld - A new test methodology for multimodal language understanding , 2017, ArXiv.

[180] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[181] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[182] Zhou Yu,et al. ALICE: Active Learning with Contrastive Natural Language Explanations , 2020, EMNLP.

[183] Kaiming He,et al. Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[184] Hao Wang,et al. All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting , 2019, AAAI.

[185] R Devon Hjelm,et al. Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[186] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[187] Kihyuk Sohn,et al. Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[188] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[189] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[190] Ali Farhadi,et al. Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[191] Amit K. Roy-Chowdhury,et al. Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval , 2018, ACM Multimedia.

[192] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[193] Eric P. Xing,et al. Learning Robust Global Representations by Penalizing Local Predictive Power , 2019, NeurIPS.

[194] Matthijs Douze,et al. Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[195] Dan Klein,et al. Learning with Latent Language , 2017, NAACL.

[196] A. Linear-probe,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[197] Jason Weston,et al. Learning from Dialogue after Deployment: Feed Yourself, Chatbot! , 2019, ACL.

[198] Hao Tian,et al. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.