Learning how to approve updates to machine learning algorithms in non-stationary settings

Machine learning algorithms in healthcare have the potential to continually learn from real-world data generated during healthcare delivery and adapt to dataset shifts. As such, the FDA is looking to design policies that can autonomously approve modifications to machine learning algorithms while maintaining or improving the safety and effectiveness of the deployed models. However, selecting a fixed approval strategy, a priori, can be difficult because its performance depends on the stationarity of the data and the quality of the proposed modifications. To this end, we investigate a learning-to-approve approach (L2A) that uses accumulating monitoring data to learn how to approve modifications. L2A defines a family of strategies that vary in their "optimism''---where more optimistic policies have faster approval rates---and searches over this family using an exponentially weighted average forecaster. To control the cumulative risk of the deployed model, we give L2A the option to abstain from making a prediction and incur some fixed abstention cost instead. We derive bounds on the average risk of the model deployed by L2A, assuming the distributional shifts are smooth. In simulation studies and empirical analyses, L2A tailors the level of optimism for each problem-setting: It learns to abstain when performance drops are common and approve beneficial modifications quickly when the distribution is stable.

[1]  Anna Goldenberg,et al.  Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks , 2019, MLHC.

[2]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[3]  I. Kohane,et al.  Development of phenotype algorithms using electronic medical records and incorporating natural language processing , 2015, BMJ : British Medical Journal.

[4]  E. Draper,et al.  APACHE II: A severity of disease classification system , 1985, Critical care medicine.

[5]  Peter L. Bartlett,et al.  Classification with a Reject Option using a Hinge Loss , 2008, J. Mach. Learn. Res..

[6]  Thomas A Lasko,et al.  Comparison of Prediction Model Performance Updating Protocols: Using a Data-Driven Testing Procedure to Guide Updating , 2020, AMIA.

[7]  Esther Rolf,et al.  Delayed Impact of Fair Machine Learning , 2018, ICML.

[8]  Sebastian Thrun,et al.  Lifelong robot learning , 1993, Robotics Auton. Syst..

[9]  Karthik Sridharan,et al.  Online Learning with Predictable Sequences , 2012, COLT.

[10]  Paul A. Harris,et al.  PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability , 2016, J. Am. Medical Informatics Assoc..

[11]  Mustafa Suleyman,et al.  Key challenges for delivering clinical impact with artificial intelligence , 2019, BMC Medicine.

[12]  Yoshua Bengio,et al.  Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[13]  Thomas A Lasko,et al.  A nonparametric updating method to correct clinical prediction model drift , 2019, J. Am. Medical Informatics Assoc..

[14]  Marcus A. Maloof,et al.  Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts , 2007, J. Mach. Learn. Res..

[15]  Ameen Abu-Hanna,et al.  Effect of changes over time in the performance of a customized SAPS-II model on the quality of care assessment , 2013 .

[16]  Alexander Rakhlin,et al.  Lecture Notes on Online Learning DRAFT , 2009 .

[17]  Vladimir Vovk,et al.  Derandomizing Stochastic Prediction Strategies , 1997, COLT '97.

[18]  Ran El-Yaniv,et al.  Selective Classification for Deep Neural Networks , 2017, NIPS.

[19]  Odalric-Ambrym Maillard,et al.  Efficient tracking of a growing number of experts , 2017, ALT.

[20]  Martin Posch,et al.  Adaptive designs for subpopulation analysis optimizing utility functions , 2014, Biometrical journal. Biometrische Zeitschrift.

[21]  Marzyeh Ghassemi,et al.  A Review of Challenges and Opportunities in Machine Learning for Health. , 2020, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[22]  Harmanpreet Kaur,et al.  Interpreting Interpretability: Understanding Data Scientists' Use of Interpretability Tools for Machine Learning , 2020, CHI.

[23]  Arjun Sondhi,et al.  Selective prediction-set models with coverage guarantees , 2019, ArXiv.

[24]  Nigam H. Shah,et al.  Developing a delivery science for artificial intelligence in healthcare , 2020, npj Digital Medicine.

[25]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[26]  D. Berry,et al.  Adaptive assignment versus balanced randomization in clinical trials: a decision analysis. , 1995, Statistics in medicine.

[27]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[28]  Cosma Rohilla Shalizi,et al.  Adapting to Non-stationarity with Growing Expert Ensembles , 2011, ArXiv.

[29]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[30]  Thomas R Fleming,et al.  Current issues in non‐inferiority trials , 2008, Statistics in medicine.

[31]  Noah Simon,et al.  Using Bayesian modeling in frequentist adaptive enrichment designs , 2018, Biostatistics.

[32]  C. K. Chow,et al.  On optimum recognition error and reject tradeoff , 1970, IEEE Trans. Inf. Theory.

[33]  Ambuj Tewari,et al.  Online Learning: Stochastic, Constrained, and Smoothed Adversaries , 2011, NIPS.

[34]  Aram Galstyan,et al.  Multitask learning and benchmarking with clinical time series data , 2017, Scientific Data.

[35]  Percy Liang,et al.  Fairness Without Demographics in Repeated Loss Minimization , 2018, ICML.

[36]  Jean Feng,et al.  Approval policies for modifications to machine learning‐based software as a medical device: A study of bio‐creep , 2019, Biometrics.

[37]  Brian W. Powers,et al.  Dissecting racial bias in an algorithm used to manage the health of populations , 2019, Science.

[38]  Alexandra Chouldechova,et al.  The Frontiers of Fairness in Machine Learning , 2018, ArXiv.

[39]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[41]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[42]  N. Simon Bayesian, Utility-Based, Adaptive Enrichment Designs with Frequentist Error Control , 2017 .

[43]  Ran El-Yaniv,et al.  On the Foundations of Noise-free Selective Classification , 2010, J. Mach. Learn. Res..

[44]  Guanhua Chen,et al.  Calibration drift in regression and machine learning models for acute kidney injury , 2017, J. Am. Medical Informatics Assoc..

[45]  Karel G M Moons,et al.  A closed testing procedure to select an appropriate method for updating prediction models , 2017, Statistics in medicine.

[46]  Peter Szolovits,et al.  High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP) , 2019, Nature Protocols.

[47]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[48]  Mario Vento,et al.  A method for improving classification reliability of multilayer perceptrons , 1995, IEEE Trans. Neural Networks.

[49]  Daniel G. Goldstein,et al.  Manipulating and Measuring Model Interpretability , 2018, CHI.

[50]  Aaron Y. Lee,et al.  Clinical applications of continual learning machine learning. , 2020, The Lancet. Digital health.