Calibration of Pre-trained Transformers

Pre-trained Transformers are now ubiquitous in natural language processing, but despite their high end-task performance, little is known empirically about whether they are calibrated. Specifically, do these models' posterior probabilities provide an accurate empirical measure of how likely the model is to be correct on a given example? We focus on BERT and RoBERTa in this work, and analyze their calibration across three tasks: natural language inference, paraphrase detection, and commonsense reasoning. For each task, we consider in-domain as well as challenging out-of-domain settings, where models face more examples they should be uncertain about. We show that: (1) when used out-of-the-box, pre-trained models are calibrated in-domain, and compared to baselines, their calibration error out-of-domain can be as much as 3.5x lower; (2) temperature scaling is effective at further reducing calibration error in-domain, and using label smoothing to deliberately increase empirical uncertainty helps calibrate posteriors out-of-domain.

[1]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[2]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[3]  Claire Cardie,et al.  Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification , 2016, TACL.

[4]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[5]  Kenneth Rose,et al.  A global optimization technique for statistical classifier design , 1996, IEEE Trans. Signal Process..

[6]  Ignacio Requena,et al.  Are artificial neural networks black boxes? , 1997, IEEE Trans. Neural Networks.

[7]  Sunita Sarawagi,et al.  Calibration of Encoder Decoder Models for Neural Machine Translation , 2019, ArXiv.

[8]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[9]  Hua He,et al.  A Continuously Growing Dataset of Sentential Paraphrases , 2017, EMNLP.

[10]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[13]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[14]  A. Raftery,et al.  Using Bayesian Model Averaging to Calibrate Forecast Ensembles , 2005 .

[15]  Xuanjing Huang,et al.  Cross-Domain Sentiment Classification with Target Domain Specific Information , 2018, ACL.

[16]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[19]  Kibok Lee,et al.  Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples , 2017, ICLR.

[20]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[23]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[24]  C. Thompson,et al.  Nurses' risk assessment judgements: a confidence calibration study. , 2010, Journal of advanced nursing.

[25]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[26]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[27]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[28]  M. Rodwell,et al.  Toward Seamless Prediction: Calibration of Climate Change Projections Using Seasonal Forecasts , 2008 .

[29]  Shrey Desai,et al.  Evaluating Lottery Tickets Under Distributional Shifts , 2019, EMNLP.

[30]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[31]  Timothy A Miller Simplified Neural Unsupervised Domain Adaptation , 2019, NAACL-HLT.

[32]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[33]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[34]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[35]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[36]  Davide Castelvecchi,et al.  Can we open the black box of AI? , 2016, Nature.

[37]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[38]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[39]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[40]  Brendan T. O'Connor,et al.  Posterior calibration and exploratory analysis for natural language processing models , 2015, EMNLP.

[41]  Jihoon Kim,et al.  Calibrating predictive model estimates to support personalized medicine , 2011, J. Am. Medical Informatics Assoc..

[42]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.