Active learning for efficiently training emulators of computationally expensive mathematical models

An emulator is a fast-to-evaluate statistical approximation of a detailed mathematical model (simulator). When used in lieu of simulators, emulators can expedite tasks that require many repeated evaluations, such as sensitivity analyses, policy optimization, model calibration, and value-of-information analyses. Emulators are developed using the output of simulators at specific input values (design points). Developing an emulator that closely approximates the simulator can require many design points, which becomes computationally expensive. We describe a self-terminating active learning algorithm to efficiently develop emulators tailored to a specific emulation task, and compare it with algorithms that optimize geometric criteria (random latin hypercube sampling and maximum projection designs) and other active learning algorithms (treed Gaussian Processes that optimize typical active learning criteria). We compared the algorithms' root mean square error (RMSE) and maximum absolute deviation from the simulator (MAX) for seven benchmark functions and in a prostate cancer screening model. In the empirical analyses, in simulators with greatly varying smoothness over the input domain, active learning algorithms resulted in emulators with smaller RMSE and MAX for the same number of design points. In all other cases, all algorithms performed comparably. The proposed algorithm attained satisfactory performance in all analyses, had smaller variability than the treed Gaussian Processes, and, on average, had similar or better performance as the treed Gaussian Processes in six out of seven benchmark functions and in the prostate cancer model.

[1]  Andrey Pepelyshev,et al.  The Role of the Nugget Term in the Gaussian Process Method , 2010, 1005.4385.

[2]  M. Mcgrath Cost Effectiveness in Health and Medicine. , 1998 .

[3]  Richard J. Beckman,et al.  A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code , 2000, Technometrics.

[4]  Jack P. C. Kleijnen Design and Analysis of Simulation Experiments , 2007 .

[5]  D. Higdon Space and Space-Time Modeling using Process Convolutions , 2002 .

[6]  Jon C. Helton,et al.  Sensitivity analysis in conjunction with evidence theory representations of epistemic uncertainty , 2006, Reliab. Eng. Syst. Saf..

[7]  Shawn E. Gano,et al.  Update strategies for kriging models used in variable fidelity optimization , 2006 .

[8]  James O. Berger,et al.  Parallel partial Gaussian process emulation for computer models with massive output , 2016 .

[9]  V. R. Joseph,et al.  Maximum projection designs for computer experiments , 2015 .

[10]  Marc C. Kennedy,et al.  Case studies in Gaussian process modelling of computer codes , 2006, Reliab. Eng. Syst. Saf..

[11]  Lurdes Y. T. Inoue,et al.  Modeling the impact of treatment and screening on U.S. breast cancer mortality: a Bayesian approach. , 2006, Journal of the National Cancer Institute. Monographs.

[12]  Robert B. Gramacy,et al.  Gaussian processes and limiting linear models , 2008, Comput. Stat. Data Anal..

[13]  Hawre Jalal,et al.  Computing Expected Value of Partial Sample Information from Probabilistic Sensitivity Analysis Using Linear Regression Metamodeling , 2015, Medical decision making : an international journal of the Society for Medical Decision Making.

[14]  Andrew Y. Ng,et al.  Fast Gaussian Process Regression using KD-Trees , 2005, NIPS.

[15]  Ruth Etzioni,et al.  Comparative Effectiveness of Alternative Prostate-Specific Antigen–Based Prostate Cancer Screening Strategies , 2013, Annals of Internal Medicine.

[16]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[17]  R. Barton,et al.  Factorial hypercube designs for spatial correlation regression , 1997 .

[18]  Ksenia N. Kyzyurova,et al.  On Uncertainty Quantification for Systems of Computer Models , 2017 .

[19]  Klemen Rojnik,et al.  Gaussian process metamodeling in Bayesian value of information analysis: a case of the complex health economic model for breast cancer screening. , 2008, Value in health : the journal of the International Society for Pharmacoeconomics and Outcomes Research.

[20]  Hugh A. Chipman,et al.  GPfit: An R Package for Fitting a Gaussian Process Model to Deterministic Simulator Outputs , 2013, 1305.0759.

[21]  Jack P. C. Kleijnen,et al.  Multivariate versus Univariate Kriging Metamodels for Multi-Response Simulation Models , 2014, Eur. J. Oper. Res..

[22]  Robert B. Gramacy,et al.  Cases for the nugget in modeling computer experiments , 2010, Statistics and Computing.

[23]  Dimitris Bertsimas,et al.  Optimal healthcare decision making under multiple mathematical models: application in prostate cancer screening , 2016, Health care management science.

[24]  D. Ginsbourger,et al.  A benchmark of kriging-based infill criteria for noisy optimization , 2013, Structural and Multidisciplinary Optimization.

[25]  Holger Dette,et al.  Generalized Latin Hypercube Design for Computer Experiments , 2010, Technometrics.

[26]  David M Eddy,et al.  Archimedes: a trial-validated model of diabetes. , 2003, Diabetes care.

[27]  David M. Eddy,et al.  Archimedes: a new model for simulating health care systems--the mathematical formulation , 2002, J. Biomed. Informatics.

[28]  Douglas K Owens,et al.  Future Directions for Cost-effectiveness Analyses in Health and Medicine , 2018, Medical decision making : an international journal of the Society for Medical Decision Making.

[29]  Marc G. Genton,et al.  Cross-Covariance Functions for Multivariate Geostatistics , 2015, 1507.08017.

[30]  Robert B. Gramacy,et al.  tgp: An R Package for Bayesian Nonstationary, Semiparametric Nonlinear Regression and Design by Treed Gaussian Process Models , 2007 .

[31]  Sonja Kuhnt,et al.  Design and analysis of computer experiments , 2010 .

[32]  G. Sanders,et al.  Cost-Effectiveness in Health and Medicine , 2016 .

[33]  Robert B. Gramacy,et al.  Ja n 20 08 Bayesian Treed Gaussian Process Models with an Application to Computer Modeling , 2009 .

[34]  He Li,et al.  OpenRBC: A Fast Simulator of Red Blood Cells at Protein Resolution , 2017, Biophysical journal.

[35]  T. Trikalinos,et al.  Recommendations for Conduct, Methodological Practices, and Reporting of Cost-effectiveness Analyses: Second Panel on Cost-Effectiveness in Health and Medicine. , 2016, JAMA.

[36]  Pritam Ranjan,et al.  A Computationally Stable Approach to Gaussian Process Interpolation of Deterministic Computer Simulation Data , 2010, Technometrics.

[37]  W. J. Studden,et al.  Design and analysis of computer experiments when the output is highly correlated over the input space , 2002 .

[38]  Robert B. Gramacy,et al.  Practical Heteroscedastic Gaussian Process Modeling for Large Simulation Experiments , 2016, Journal of Computational and Graphical Statistics.

[39]  Inci Batmaz,et al.  Small response surface designs for metamodel estimation , 2003, Eur. J. Oper. Res..

[40]  A. O'Hagan,et al.  Bayesian emulation of complex multi-output and dynamic computer models , 2010 .

[41]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[42]  Luc Pronzato,et al.  Design of computer experiments: space filling and beyond , 2011, Statistics and Computing.

[43]  R. Gramacy,et al.  Categorical Inputs, Sensitivity Analysis, Optimization and Importance Tempering with tgp Version 2, an R Package for Treed Gaussian Process Models , 2010 .

[44]  Saltelli Andrea,et al.  Global Sensitivity Analysis: The Primer , 2008 .

[45]  Jack P. C. Kleijnen,et al.  Application-driven sequential designs for simulation experiments: Kriging metamodelling , 2004, J. Oper. Res. Soc..

[46]  T. Simpson,et al.  Comparative studies of metamodelling techniques under multiple modelling criteria , 2001 .

[47]  R. Carnell Latin Hypercube Samples , 2016 .

[48]  Robert W. Blanning,et al.  The construction and implementation of metamodels , 1975 .

[49]  J. Savarino,et al.  Bayesian Calibration of Microsimulation Models , 2009, Journal of the American Statistical Association.

[50]  A. OHagan,et al.  Bayesian analysis of computer code outputs: A tutorial , 2006, Reliab. Eng. Syst. Saf..

[51]  Bernard Yannou,et al.  Metamodeling of Combined Discrete/Continuous Responses , 2001 .

[52]  B. Nelson,et al.  Using common random numbers for indifference-zone selection and multiple comparisons in simulation , 1995 .

[53]  Tiago M de Carvalho,et al.  Evaluating Parameter Uncertainty in a Simulation Model of Cancer Using Emulators , 2019, Medical decision making : an international journal of the Society for Medical Decision Making.

[54]  Jack P. C. Kleijnen,et al.  A methodology for fitting and validating metamodels in simulation , 2000, Eur. J. Oper. Res..

[55]  M. Cooperberg,et al.  Expected population impacts of discontinued prostate‐specific antigen screening , 2014, Cancer.

[56]  M. E. Johnson,et al.  Minimax and maximin distance designs , 1990 .

[57]  James O. Berger,et al.  Coupling Computer Models through Linking Their Statistical Emulators , 2018, SIAM/ASA J. Uncertain. Quantification.

[58]  Randy R. Sitter,et al.  A new and flexible method for constructing designs for computer experiments , 2010, 1010.0328.

[59]  Jillian T Henderson,et al.  Use of Decision Models in the Development of Evidence-Based Clinical Preventive Services Recommendations: Methods of the U.S. Preventive Services Task Force. , 2016, Annals of internal medicine.

[60]  Gerald T. Mackulak,et al.  D-Optimal Sequential Experiments for Generating a Simulation-Based Cycle Time-Throughput Curve , 2002, Oper. Res..

[61]  C G Chute,et al.  Serum prostate-specific antigen in a community-based population of healthy men. Establishment of age-specific reference ranges. , 1993, JAMA.

[62]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[63]  Ruth Etzioni,et al.  Calibrating disease progression models using population data: a critical precursor to policy development in cancer control. , 2010, Biostatistics.