Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clip_quality_not_quantity.

[1]  Ari S. Morcos,et al.  Beyond neural scaling laws: beating power law scaling via data pruning , 2022, NeurIPS.

[2]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[3]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[4]  Alexander W. Fang,et al.  Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP) , 2022, ICML.

[5]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[6]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[7]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[8]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[9]  Ari S. Morcos,et al.  Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , 2022, ICML.

[10]  Chen Change Loy,et al.  Conditional Prompt Learning for Vision-Language Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  David J. Fleet,et al.  Kubric: A scalable dataset generator , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[13]  Saining Xie,et al.  SLIP: Self-supervision meets Language-Image Pre-training , 2021, ECCV.

[14]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[15]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[16]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[17]  Karan Desai,et al.  RedCaps: web-curated image-text data created by the people, for the people , 2021, NeurIPS Datasets and Benchmarks.

[18]  Quoc V. Le,et al.  Combined Scaling for Zero-shot Transfer Learning , 2021, Neurocomputing.

[19]  Ron Mokady,et al.  ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.

[20]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Zhenguo Li,et al.  FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[22]  Peng Gao,et al.  Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling , 2021, ArXiv.

[23]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[24]  Raphael Gontijo Lopes,et al.  No One Representation to Rule Them All: Overlapping Features of Training Methods , 2021, ICLR.

[25]  Ethan Caballero,et al.  Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers , 2021, ArXiv.

[26]  Junjie Yan,et al.  Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm , 2021, ICLR.

[27]  Peng Gao,et al.  CLIP-Adapter: Better Vision-Language Models with Feature Adapters , 2021, Int. J. Comput. Vis..

[28]  Vinay Uday Prabhu,et al.  Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[29]  Jong Wook Kim,et al.  Robust fine-tuning of zero-shot models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  G. Dziugaite,et al.  Deep Learning on a Data Diet: Finding Important Examples Early in Training , 2021, NeurIPS.

[31]  Yair Carmon,et al.  Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization , 2021, ICML.

[32]  Behnam Neyshabur,et al.  The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning , 2021, Trans. Mach. Learn. Res..

[33]  Jiecao Chen,et al.  WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , 2021, SIGIR.

[34]  Rishabh K. Iyer,et al.  GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training , 2021, ICML.

[35]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[36]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[37]  Radu Soricut,et al.  Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[39]  Pang Wei Koh,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[40]  Alexander D'Amour,et al.  Underspecification Presents Challenges for Credibility in Modern Machine Learning , 2020, J. Mach. Learn. Res..

[41]  Lin Gao,et al.  3D-FUTURE: 3D Furniture Shape with TextURE , 2020, International Journal of Computer Vision.

[42]  Alexander D'Amour,et al.  On Robustness and Transferability of Convolutional Neural Networks , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Benjamin Recht,et al.  Evaluating Machine Accuracy on ImageNet , 2020, ICML.

[44]  David Lopez-Paz,et al.  In Search of Lost Domain Generalization , 2020, ICLR.

[45]  Benjamin Recht,et al.  Measuring Robustness to Natural Distribution Shifts in Image Classification , 2020, NeurIPS.

[46]  D. Song,et al.  The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[48]  Benjamin Recht,et al.  The Effect of Natural Distribution Shift on Question Answering Models , 2020, ICML.

[49]  Aaron C. Courville,et al.  Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[50]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[51]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[52]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[53]  Tatsunori B. Hashimoto,et al.  Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , 2019, ArXiv.

[54]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Balaji Lakshminarayanan,et al.  Deep Ensembles: A Loss Landscape Perspective , 2019, ArXiv.

[56]  Jason J. Corso,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[57]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[58]  Sanjoy Dasgupta,et al.  Teaching a black-box learner , 2019, ICML.

[59]  Eric P. Xing,et al.  Learning Robust Global Representations by Penalizing Local Predictive Power , 2019, NeurIPS.

[60]  Begüm Demir,et al.  Bigearthnet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding , 2019, IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium.

[61]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[62]  Amos J. Storkey,et al.  School of Informatics, University of Edinburgh , 2022 .

[63]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[64]  Max Welling,et al.  Rotation Equivariant CNNs for Digital Pathology , 2018, MICCAI.

[65]  Fabio Roli,et al.  Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning , 2017, Pattern Recognit..

[66]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[67]  Yang Song,et al.  The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68]  Stefan Leutenegger,et al.  SceneNet RGB-D: 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth , 2016, ArXiv.

[69]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[70]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[71]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[73]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[74]  Fabio Roli,et al.  Evasion Attacks against Machine Learning at Test Time , 2013, ECML/PKDD.

[75]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[76]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[77]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[78]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[79]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[80]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[81]  Rita Cucchiara,et al.  From Show to Tell: A Survey on Image Captioning , 2021, ArXiv.

[82]  Boris Katz,et al.  ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.

[83]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .