Is this model reliable for everyone? Testing for strong calibration

In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model.

[1]  N. Petrick,et al.  Monitoring machine learning (ML)-based risk prediction algorithms in the presence of confounding medical interventions , 2022, ArXiv.

[2]  S. Ermon,et al.  Modular Conformal Calibration , 2022, ICML.

[3]  Rachael V. Phillips,et al.  Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare , 2022, npj Digital Medicine.

[4]  Jared A. Dunnmon,et al.  Domino: Discovering Systematic Errors with Cross-Modal Embeddings , 2022, ICLR.

[5]  Leying Guan,et al.  Localized Conformal Prediction: A Generalized Inference Framework for Conformal Prediction , 2021, Biometrika.

[6]  Ali Shojaie,et al.  Inference on function-valued parameters using a restricted score test , 2021, 2105.06646.

[7]  Yuekai Sun,et al.  Statistical inference for individual fairness , 2021, ICLR.

[8]  Shira Mitchell,et al.  Algorithmic Fairness: Choices, Assumptions, and Definitions , 2021, Annual Review of Statistics and Its Application.

[9]  S. Savarese,et al.  Local calibration: metrics and recalibration , 2021, UAI.

[10]  Eitan Bachmat,et al.  Addressing bias in prediction models by improving subpopulation calibration , 2020, J. Am. Medical Informatics Assoc..

[11]  John C. Duchi,et al.  Distributionally Robust Losses for Latent Covariate Mixtures , 2020, Oper. Res..

[12]  Kinjal Basu,et al.  Evaluating Fairness Using Permutation Tests , 2020, KDD.

[13]  Tengyu Ma,et al.  Individual Calibration with Randomized Forecasting , 2020, ICML.

[14]  Jean Feng,et al.  Efficient nonparametric statistical inference on population feature importance using Shapley values , 2020, ICML.

[15]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[16]  Yuekai Sun,et al.  Auditing ML Models for Individual Bias and Unfairness , 2020, AISTATS.

[17]  Martin Vechev,et al.  Learning Certified Individually Fair Representations , 2020, NeurIPS.

[18]  Yuhong Yang,et al.  Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning , 2019, Journal of the American Statistical Association.

[19]  Emmanuel J. Candès,et al.  With Malice Towards None: Assessing Uncertainty via Equalized Coverage , 2019, ArXiv.

[20]  Rajen Dinesh Shah,et al.  Goodness‐of‐fit testing in high dimensional generalized linear models , 2019, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[21]  Christina Ilvento,et al.  Metric Learning for Individual Fairness , 2019, FORC.

[22]  E. Candès,et al.  The limits of distribution-free conditional predictive inference , 2019, Information and Inference: A Journal of the IMA.

[23]  Alexander Gammerman,et al.  Conformal calibrators , 2019, COPA.

[24]  John C. Duchi,et al.  Learning Models with Uniform Performance via Distributionally Robust Optimization , 2018, ArXiv.

[25]  Tim Kraska,et al.  Slice Finder: Automated Data Slicing for Model Validation , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[26]  Guy N. Rothblum,et al.  Multicalibration: Calibration for the (Computationally-Identifiable) Masses , 2018, ICML.

[27]  James Y. Zou,et al.  Multiaccuracy: Black-Box Post-Processing for Fairness in Classification , 2018, AIES.

[28]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[29]  E. Gombay Editor’s special invited paper: On the efficient score vector in sequential monitoring , 2017 .

[30]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[31]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[32]  Yvonne Vergouwe,et al.  A calibration hierarchy for risk models was defined: from utopia to empirical data. , 2016, Journal of clinical epidemiology.

[33]  Jianxin Shi,et al.  Developing and evaluating polygenic risk prediction models for stratified disease prevention , 2016, Nature Reviews Genetics.

[34]  Ewout W. Steyerberg,et al.  F1000Prime recommendation of Calibration of risk prediction models: impact on decision-analytic performance. , 2014 .

[35]  Diane Lacaille,et al.  2013 ACC/AHA Guideline on the Assessment of Cardiovascular Risk , 2014 .

[36]  Larry Wasserman,et al.  Distribution‐free prediction bands for non‐parametric regression , 2014 .

[37]  Vladimir Vovk,et al.  Conditional validity of inductive conformal predictors , 2012, Machine Learning.

[38]  Anja De Waegenaere,et al.  Robust Solutions of Optimization Problems Affected by Uncertain Probabilities , 2011, Manag. Sci..

[39]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[40]  Edit Gombay,et al.  Sequential Change-Point Detection and Estimation , 2003 .

[41]  Amit Mitra,et al.  Statistical Quality Control , 2002, Technometrics.

[42]  Nils Lid Hjort,et al.  Goodness‐of‐fit processes for logistic regression: simulation results , 2002, Statistics in medicine.

[43]  Z. Ying,et al.  Model‐Checking Techniques Based on Cumulative Residuals , 2002, Biometrics.

[44]  D. Hosmer,et al.  A comparison of goodness-of-fit tests for the logistic regression model. , 1997, Statistics in medicine.

[45]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[46]  Douglas M. Hawkins,et al.  Diagnostics for use with regression recursive residuals , 1991 .

[47]  Suchi Saria,et al.  Evaluating Model Robustness and Stability to Dataset Shift , 2021, AISTATS.

[48]  Hongseok Namkoong,et al.  Evaluating model performance under worst-case subpopulations , 2024, NeurIPS.

[49]  Jennifer G. Robinson,et al.  Reply: 2013 ACC/AHA guideline on the assessment of cardiovascular risk. , 2014, Journal of the American College of Cardiology.

[50]  W. Gasarch,et al.  The Book Review Column 1 Coverage Untyped Systems Simple Types Recursive Types Higher-order Systems General Impression 3 Organization, and Contents of the Book , 2022 .

[51]  John C. Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[52]  A. Tsiatis A note on a goodness-of-fit test for the logistic regression model , 1980 .

[53]  J. Durbin,et al.  Techniques for Testing the Constancy of Regression Relationships Over Time , 1975 .