Extending the WILDS Benchmark for Unsupervised Adaptation

Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data. However, existing distribution shift benchmarks for unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the Wilds 2.0 update, which extends 8 of the 10 datasets in the Wilds benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. To maintain consistency, the labeled training, validation, and test sets, as well as the evaluation metrics, are exactly the same as in the original Wilds benchmark. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). We systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on Wilds is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu. ∗. These authors contributed equally to this work. 1 ar X iv :2 11 2. 05 09 0v 1 [ cs .L G ] 9 D ec 2 02 1

[1]  Kui Jia,et al.  Semi-supervised Models are Strong Unsupervised Domain Adaptation Learners , 2021, ArXiv.

[2]  Yuchen Zhang,et al.  Bridging Theory and Algorithm for Domain Adaptation , 2019, ICML.

[3]  George J. Pappas,et al.  Model-Based Domain Generalization , 2021, NeurIPS.

[4]  Guodong Zhou,et al.  Semi-Supervised Learning for Imbalanced Sentiment Classification , 2011, IJCAI.

[5]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[6]  Shuiwang Ji,et al.  Self-Supervised Learning of Graph Neural Networks: A Unified Review , 2021, ArXiv.

[7]  Jian Tang,et al.  InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization , 2019, ICLR.

[8]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Omiros Pantazis,et al.  Focus on the Positives: Self-Supervised Learning for Biodiversity Monitoring , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Yoshua Bengio,et al.  Entropy Regularization , 2006, Semi-Supervised Learning.

[13]  J. Lee,et al.  Predicting What You Already Know Helps: Provable Self-Supervised Learning , 2020, NeurIPS.

[14]  Qingyao Wu,et al.  From Whole Slide Imaging to Microscopy: Deep Microscopy Adaptation Network for Histopathology Cancer Image Classification , 2019, MICCAI.

[15]  Ashirbani Saha,et al.  Deep learning for segmentation of brain tumors: Impact of cross‐institutional training and testing , 2018, Medical physics.

[16]  Yatao Bian,et al.  Self-Supervised Graph Transformer on Large-Scale Molecular Data , 2020, NeurIPS.

[17]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[18]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[19]  Xuan Yang,et al.  When Does Contrastive Visual Representation Learning Work? , 2021, ArXiv.

[20]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Subhransu Maji,et al.  The Semi-Supervised iNaturalist-Aves Challenge at FGVC7 Workshop , 2021, ArXiv.

[22]  Kate Saenko,et al.  Return of Frustratingly Easy Domain Adaptation , 2015, AAAI.

[23]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Nasir M. Rajpoot,et al.  Leveraging Unlabeled Whole-Slide-Images for Mitosis Detection , 2018, COMPAY/OMIA@MICCAI.

[25]  Kurt Keutzer,et al.  Self-Supervised Pretraining Improves Self-Supervised Pretraining , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[26]  Yanli Wang,et al.  PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[27]  Anne L. Martel,et al.  Self supervised contrastive learning for digital histopathology , 2020, Machine Learning with Applications.

[28]  Vijay S. Pande,et al.  MoleculeNet: a benchmark for molecular machine learning , 2017, Chemical science.

[29]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[30]  Kate Saenko,et al.  Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.

[31]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[32]  Lucy Vasserman,et al.  Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[33]  Daniel Jurafsky,et al.  Word embeddings quantify 100 years of gender and ethnic stereotypes , 2017, Proceedings of the National Academy of Sciences.

[34]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[35]  Trevor Darrell,et al.  Tune it the Right Way: Unsupervised Validation of Domain Adaptation via Soft Neighborhood Density , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Maciej Pajak,et al.  Teacher-Student chain for efficient semi-supervised histology image classification , 2020, ArXiv.

[37]  Lifu Tu,et al.  An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models , 2020, Transactions of the Association for Computational Linguistics.

[38]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[39]  Javier Otegui,et al.  The GBIF Integrated Publishing Toolkit: Facilitating the Efficient Publishing of Biodiversity Data on the Internet , 2014, PloS one.

[40]  Ming Y. Lu,et al.  Semi-Supervised Histology Classification using Deep Multiple Instance Learning and Contrastive Predictive Coding , 2019, ArXiv.

[41]  Colin Wei,et al.  Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning , 2021, ArXiv.

[42]  Sethuraman Panchanathan,et al.  Deep Hashing Network for Unsupervised Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Yair Carmon,et al.  Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization , 2021, ICML.

[45]  Kate Saenko,et al.  Surprisingly Simple Semi-Supervised Domain Adaptation with Pretraining and Consistency , 2021, BMVC.

[46]  Pavitra Krishnaswamy,et al.  Self-Path: Self-Supervision for Classification of Pathology Images With Limited Annotations , 2020, IEEE Transactions on Medical Imaging.

[47]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[48]  Stefano Ermon,et al.  Transfer Learning from Deep Features for Remote Sensing and Poverty Mapping , 2015, AAAI.

[49]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[50]  Michael I. Jordan,et al.  Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[51]  Shekoofeh Azizi,et al.  Big Self-Supervised Models Advance Medical Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Daisuke Komura,et al.  Machine Learning Methods for Histopathological Image Analysis , 2017, Computational and structural biotechnology journal.

[53]  Sang Michael Xie,et al.  Combining satellite imagery and machine learning to predict poverty , 2016, Science.

[54]  Pierre Courtiol,et al.  Self-Supervision Closes the Gap Between Weak and Strong Supervision in Histology , 2020, ArXiv.

[55]  Bo Wang,et al.  Moment Matching for Multi-Source Domain Adaptation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[57]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Minhajul A. Badhon,et al.  Global Wheat Head Detection 2021: An Improved Dataset for Benchmarking Wheat Head Detection Methods , 2021, Plant phenomics.

[59]  Stefano Ermon,et al.  Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance , 2018, NeurIPS.

[60]  Aylin Caliskan,et al.  Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases , 2020, FAccT.

[61]  Quoc V. Le,et al.  Towards Domain-Agnostic Contrastive Learning , 2020, ICML.

[62]  Walter Jetz,et al.  Wildlife Insights: A Platform to Maximize the Potential of Camera Trap and Other Passive Sensor Wildlife Data for the Planet , 2019, Environmental Conservation.

[63]  Xiaodong Liu,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[64]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[65]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[66]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[67]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[68]  Vincent Ng,et al.  Mine the Easy, Classify the Hard: A Semi-Supervised Approach to Automatic Sentiment Classification , 2009, ACL.

[69]  Ruslan Salakhutdinov,et al.  Conditional Contrastive Learning: Removing Undesirable Information in Self-Supervised Representations , 2021, ArXiv.

[70]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[71]  Abhimanyu Dubey,et al.  Adaptive Methods for Real-World Domain Generalization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[73]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[74]  Xin Qi,et al.  Adversarial Domain Adaptation for Classification of Prostate Histopathology Whole-Slide Images , 2018, MICCAI.

[75]  Sara Beery,et al.  The iWildCam 2020 Competition Dataset , 2020, ArXiv.

[76]  Anne L. Martel,et al.  A Cluster-then-label Semi-supervised Learning Approach for Pathology Image Classification , 2018, Scientific Reports.

[77]  Kate Saenko,et al.  VisDA: A Synthetic-to-Real Benchmark for Visual Domain Adaptation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[78]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[79]  Jure Leskovec,et al.  Strategies for Pre-training Graph Neural Networks , 2020, ICLR.

[80]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[81]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[82]  Percy Liang,et al.  Distributionally Robust Language Modeling , 2019, EMNLP.

[83]  Tengyu Ma,et al.  In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness , 2020, ICLR.

[84]  Dan Morris,et al.  Efficient Pipeline for Camera Trap Image Review , 2019, ArXiv.

[85]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Serge Belongie,et al.  Benchmarking Representation Learning for Natural World Image Collections , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Benjamin Recht,et al.  Measuring Robustness to Natural Distribution Shifts in Image Classification , 2020, NeurIPS.

[88]  David Lopez-Paz,et al.  In Search of Lost Domain Generalization , 2020, ICLR.

[89]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[90]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[91]  Pietro Perona,et al.  Recognition in Terra Incognita , 2018, ECCV.

[92]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[93]  Trevor Darrell,et al.  Adapting Visual Category Models to New Domains , 2010, ECCV.

[94]  Tatsuya Harada,et al.  Maximum Classifier Discrepancy for Unsupervised Domain Adaptation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[95]  Liang Lin,et al.  Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[96]  Mohammad Sadegh Norouzzadeh,et al.  A deep active learning system for species identification and counting in camera trap images , 2019, Methods in Ecology and Evolution.

[97]  Siva Reddy,et al.  StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.

[98]  Percy Liang,et al.  Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , 2019, ArXiv.

[99]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[100]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[101]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[102]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[103]  Yishay Mansour,et al.  Domain Adaptation with Multiple Sources , 2008, NIPS.

[104]  José M. F. Moura,et al.  Adversarial Multiple Source Domain Adaptation , 2018, NeurIPS.

[105]  J. Leskovec,et al.  Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[106]  Yu-Gang Jiang,et al.  Cross-domain Contrastive Learning for Unsupervised Domain Adaptation , 2022, IEEE Transactions on Multimedia.

[107]  Gordon Christie,et al.  Functional Map of the World , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[108]  Alice J. Edney,et al.  A general deep learning model for bird detection in high resolution airborne imagery , 2021, bioRxiv.

[109]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[110]  Ethical Considerations when Using Geospatial Technologies for Evidence Generation , 2018, Innocenti Research Briefs.

[111]  F. Baret,et al.  Global Wheat Head Detection (GWHD) Dataset: A Large and Diverse Dataset of High-Resolution RGB-Labelled Images to Develop and Benchmark Wheat Head Detection Methods , 2020, Plant phenomics.

[112]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[113]  Tatsuya Harada,et al.  Asymmetric Tri-training for Unsupervised Domain Adaptation , 2017, ICML.

[114]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[115]  Hady Elsahar,et al.  Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.

[116]  Ben G. Weinstein,et al.  Individual Tree-Crown Detection in RGB Imagery Using Semi-Supervised Deep Learning Neural Networks , 2019, bioRxiv.

[117]  Devis Tuia,et al.  Half a Percent of Labels is Enough: Efficient Animal Detection in UAV Imagery Using Deep CNNs and Active Learning , 2019, IEEE Transactions on Geoscience and Remote Sensing.

[118]  Taesung Park,et al.  CyCADA: Cycle-Consistent Adversarial Domain Adaptation , 2017, ICML.

[119]  Benjamin Recht,et al.  The Effect of Natural Distribution Shift on Question Answering Models , 2020, ICML.

[120]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[121]  Percy Liang,et al.  Selective Question Answering under Domain Shift , 2020, ACL.

[122]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[123]  James Zou,et al.  Persistent Anti-Muslim Bias in Large Language Models , 2021, AIES.

[124]  Geert J. S. Litjens,et al.  Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology , 2019, Medical Image Anal..

[125]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[126]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[127]  John Blitzer,et al.  Domain adaptation of natural language processing systems , 2008 .

[128]  Gabrielle Berman,et al.  Ethical Considerations when Using Geospatial Technologies for Evidence Generation , 2018, Innocenti Research Briefs.

[129]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[130]  Craig Willis,et al.  TERRA-REF Data Processing Infrastructure , 2018, PEARC.

[131]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[132]  Alexander D'Amour,et al.  On Robustness and Transferability of Convolutional Neural Networks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[133]  Yongxin Yang,et al.  Deeper, Broader and Artier Domain Generalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[134]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[135]  Michael I. Jordan,et al.  Conditional Adversarial Domain Adaptation , 2017, NeurIPS.

[136]  David Berthelot,et al.  AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation , 2021, ArXiv.

[137]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[138]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[139]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[140]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[141]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[142]  Anne Driscoll,et al.  Using publicly available satellite imagery and deep learning to understand economic well-being in Africa , 2020, Nature Communications.

[143]  Mitko Veta,et al.  Mitosis Counting in Breast Cancer: Object-Level Interobserver Agreement and Comparison to an Automatic Method , 2016, PloS one.

[144]  Shaoqun Zeng,et al.  From Detection of Individual Metastases to Classification of Lymph Node Status at the Patient Level: The CAMELYON17 Challenge , 2019, IEEE Transactions on Medical Imaging.

[145]  Dong Xu,et al.  Collaborative and Adversarial Network for Unsupervised Domain Adaptation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[146]  Yi Chern Tan,et al.  Assessing Social and Intersectional Biases in Contextualized Word Representations , 2019, NeurIPS.