Model-based metrics: Sample-efficient estimates of predictive model subpopulation performance

Machine learning models — now commonly developed to screen, diagnose, or predict health conditions — are evaluated with a variety of performance metrics. An important first step in assessing the practical utility of a model is to evaluate its average performance over a population of interest. In many settings, it is also critical that the model makes good predictions within predefined subpopulations. For instance, showing that a model is fair or equitable requires evaluating the model’s performance in different demographic subgroups. However, subpopulation performance metrics are typically computed using only data from that subgroup, resulting in higher variance estimates for smaller groups. We devise a procedure to measure subpopulation performance that can be more sample-efficient than the typical estimator. We propose using an evaluation model — a model that describes the conditional distribution of the predictive model score — to form model-based metric (MBM) estimates. Our procedure incorporates model checking and validation, and we propose a computationally efficient approximation of the traditional nonparametric bootstrap to form confidence intervals. We evaluate MBMs on two tasks: a semi-synthetic setting where ground truth metrics are available and a real-world hospital readmission prediction task. We find that MBMs consistently produce more accurate and lower variance estimates of model performance, particularly for small subpopulations.

[1]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[2]  Jiqiang Guo,et al.  Stan: A Probabilistic Programming Language. , 2017, Journal of statistical software.

[3]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[4]  Paul-Christian Bürkner,et al.  brms: An R Package for Bayesian Multilevel Models Using Stan , 2017 .

[5]  A. Gelman,et al.  Pareto Smoothed Importance Sampling , 2015, 1507.02646.

[6]  Jeffrey W. Miller,et al.  Robust and Reproducible Model Selection Using Bagged Posteriors , 2020 .

[7]  Padhraic Smyth,et al.  Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference , 2020, NeurIPS.

[8]  Beata Strack,et al.  Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records , 2014, BioMed research international.

[9]  N John Bosomworth,et al.  Practical use of the Framingham risk score in primary prevention: Canadian perspective. , 2011, Canadian family physician Medecin de famille canadien.

[10]  Blaise Hanczar,et al.  Small-sample precision of ROC-related estimates , 2010, Bioinform..

[11]  Mark W. Lipsey,et al.  Practical Meta-Analysis , 2000 .

[12]  Marco Zaffalon,et al.  Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis , 2016, J. Mach. Learn. Res..

[13]  Peter Buhlmann Discussion of Big Bayes Stories and BayesBag , 2014, 1405.4977.

[14]  M. Budoff,et al.  Mediators of Atherosclerosis in South Asians Living in America (MASALA) Study: Objectives, Methods, and Cohort Description , 2013, Clinical cardiology.

[15]  Peter D. Hoff,et al.  Adaptive multigroup confidence intervals with constant coverage , 2016, 1612.08287.

[16]  Benjamin Recht,et al.  Do CIFAR-10 Classifiers Generalize to CIFAR-10? , 2018, ArXiv.

[17]  Jeffrey W. Miller,et al.  Robust Inference and Model Criticism Using Bagged Posteriors , 2019 .

[18]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[19]  Francis Tuerlinckx,et al.  Diagnostic checks for discrete data regression models using posterior predictive simulations , 2000 .

[20]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[21]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[22]  Luke Oakden-Rayner,et al.  Docs are ROCs: A simple off-the-shelf approach for estimating average human performance in diagnostic studies , 2020, 2009.11060.

[23]  Wesley O Johnson,et al.  Gold standards are out and Bayes is in: Implementing the cure for imperfect reference tests in diagnostic accuracy studies. , 2019, Preventive veterinary medicine.

[24]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[25]  P. Daniore,et al.  Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015-20): a comparative analysis. , 2021, The Lancet. Digital health.

[26]  M. Pencina,et al.  General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study , 2008, Circulation.

[27]  Aki Vehtari,et al.  Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC , 2015, Statistics and Computing.