The MultiBERTs: BERT Reproductions for Robustness Analysis

Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERTbase checkpoints, trained with similar hyperparameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics.

[1]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[2]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[3]  Chen Liang,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[6]  R. Thomas McCoy,et al.  BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance , 2020, BLACKBOXNLP.

[7]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[8]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[9]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[10]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[11]  Alexander D'Amour,et al.  Underspecification Presents Challenges for Credibility in Modern Machine Learning , 2020, J. Mach. Learn. Res..

[12]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[13]  Alexandre Lacoste,et al.  Quantifying the Carbon Emissions of Machine Learning , 2019, ArXiv.

[14]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[15]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[17]  Hai Zhao,et al.  Semantics-aware BERT for Language Understanding , 2020, AAAI.

[18]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[19]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[20]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[21]  Hao Tan,et al.  The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions , 2020, EMNLP.

[22]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[23]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[24]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[25]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[26]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[27]  Rotem Dror,et al.  Deep Dominance - How to Properly Compare Deep Neural Models , 2019, ACL.

[28]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[29]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[30]  Siva Reddy,et al.  StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.

[31]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[32]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[33]  C. Field,et al.  Bootstrapping clustered data , 2007 .

[34]  Dan Klein,et al.  Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level , 2021, FINDINGS.

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Hongbo Zhang,et al.  Quora Question Pairs , 2017 .