Linking big models to big data: efficient ecosystem model calibration through Bayesian model emulation

Abstract. Data-model integration plays a critical role in assessing and improving our capacity to predict ecosystem dynamics. Similarly, the ability to attach quantitative statements of uncertainty around model forecasts is crucial for model assessment and interpretation and for setting field research priorities. Bayesian methods provide a rigorous data assimilation framework for these applications, especially for problems with multiple data constraints. However, the Markov chain Monte Carlo (MCMC) techniques underlying most Bayesian calibration can be prohibitive for computationally demanding models and large datasets. We employ an alternative method, Bayesian model emulation of sufficient statistics, that can approximate the full joint posterior density, is more amenable to parallelization, and provides an estimate of parameter sensitivity. Analysis involved informative priors constructed from a meta-analysis of the primary literature and specification of both model and data uncertainties, and it introduced novel approaches to autocorrelation corrections on multiple data streams and emulating the sufficient statistics surface. We report the integration of this method within an ecological workflow management software, Predictive Ecosystem Analyzer (PEcAn), and its application and validation with two process-based terrestrial ecosystem models: SIPNET and ED2. In a test against a synthetic dataset, the emulator was able to retrieve the true parameter values. A comparison of the emulator approach to standard brute-force MCMC involving multiple data constraints showed that the emulator method was able to constrain the faster and simpler SIPNET model's parameters with comparable performance to the brute-force approach but reduced computation time by more than 2 orders of magnitude. The emulator was then applied to calibration of the ED2 model, whose complexity precludes standard (brute-force) Bayesian data assimilation techniques. Both models are constrained after assimilation of the observational data with the emulator method, reducing the uncertainty around their predictions. Performance metrics showed increased agreement between model predictions and data. Our study furthers efforts toward reducing model uncertainties, showing that the emulator method makes it possible to efficiently calibrate complex models.

[1]  Jasper A. Vrugt,et al.  High‐dimensional posterior exploration of hydrologic models using multiple‐try DREAM(ZS) and high‐performance computing , 2012 .

[2]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[3]  Ming Ye,et al.  Towards a comprehensive assessment of model structural adequacy , 2012 .

[4]  Richard A. Birdsey,et al.  Comprehensive database of diameter-based biomass regressions for North American tree species , 2004 .

[5]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[6]  Michael C Dietze,et al.  Prediction in ecology: a first-principles framework. , 2017, Ecological applications : a publication of the Ecological Society of America.

[7]  James S. Clark,et al.  Why environmental scientists are becoming Bayesians , 2004 .

[8]  A. O'Hagan,et al.  Bayesian calibration of computer models , 2001 .

[9]  A. O'Hagan,et al.  Quantifying uncertainty in the biospheric carbon flux for England and Wales , 2007 .

[10]  Jerome Sacks,et al.  Choosing the Sample Size of a Computer Experiment: A Practical Guide , 2009, Technometrics.

[11]  K. Davis,et al.  A Bayesian calibration of a simple carbon cycle model: The role of observations in estimating and reducing uncertainty , 2008 .

[12]  Murali Haran,et al.  Emulating a gravity model to infer the spatiotemporal dynamics of an infectious disease , 2011, 1110.6451.

[13]  Michael U. Gutmann,et al.  Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models , 2015, J. Mach. Learn. Res..

[14]  Harold E. Burkhart,et al.  Leveraging 35 years of Pinus taeda research in the southeastern US to constrain forest carbon cycle predictions: regional data assimilation using ecosystem experiments , 2017 .

[15]  Q. Duan,et al.  Parameter optimization for carbon and water fluxes in two global land surface models based on surrogate modelling , 2018 .

[16]  Paul Marjoram,et al.  Statistical Applications in Genetics and Molecular Biology Approximately Sufficient Statistics and Bayesian Computation , 2011 .

[17]  J. Gove,et al.  The REFLEX project: Comparing different algorithms and implementations for the inversion of a terrestrial ecosystem model against eddy covariance data , 2009 .

[18]  E. Davidson,et al.  Estimating parameters of a forest ecosystem C model with measurements of stocks and fluxes as joint constraints , 2010, Oecologia.

[19]  Kirthevasan Kandasamy,et al.  Bayesian active learning for posterior estimation , 2015 .

[20]  Bruce E. Ankenman,et al.  Comparison of Gaussian process modeling software , 2016, 2016 Winter Simulation Conference (WSC).

[21]  Rob Kooper,et al.  BETYdb: a yield, trait, and ecosystem service database applied to second‐generation bioenergy feedstock production , 2018 .

[22]  Sudipto Banerjee,et al.  On nearest‐neighbor Gaussian process models for massive spatial data , 2016, Wiley interdisciplinary reviews. Computational statistics.

[23]  L. Price,et al.  Learn-as-you-go acceleration of cosmological parameter estimates , 2015, 1506.01079.

[24]  Michael C. Dietze,et al.  Facilitating feedbacks between field measurements and ecosystem models , 2013 .

[25]  M. Dietze,et al.  A Predictive Framework to Understand Forest Responses to Global Change , 2009, Annals of the New York Academy of Sciences.

[26]  Markus Reichstein,et al.  The model–data fusion pitfall: assuming certainty in an uncertain world , 2011, Oecologia.

[27]  Wei Gong,et al.  An evaluation of adaptive surrogate modeling based optimization with two benchmark problems , 2014, Environ. Model. Softw..

[28]  Eric A Davidson,et al.  Rate my data: quantifying the value of ecological data for the development of models of the terrestrial carbon cycle. , 2013, Ecological applications : a publication of the Ecological Society of America.

[29]  Jenný Brynjarsdóttir,et al.  Learning about physical parameters: the importance of model discrepancy , 2014 .

[30]  Thomas J. Santner,et al.  The Design and Analysis of Computer Experiments , 2003, Springer Series in Statistics.

[31]  S. Wofsy,et al.  Mechanistic scaling of ecosystem function and dynamics in space and time: Ecosystem Demography model version 2 , 2009 .

[32]  P. Moorcroft,et al.  Tree mortality in the eastern and central United States: patterns and drivers , 2011 .

[33]  Ming Ye,et al.  The multi-assumption architecture and testbed (MAAT v1.0): R code for generating ensembles with dynamic model structure and analysis of epistemic uncertainty from multiple sources , 2018, Geoscientific Model Development.

[34]  M. P.R.,et al.  A METHOD FOR SCALING VEGETATION DYNAMICS: THE ECOSYSTEM DEMOGRAPHY MODEL (ED) , 2022 .

[35]  Atul K. Jain,et al.  Using ecosystem experiments to improve vegetation models , 2015 .

[36]  S. Sitch,et al.  Modeling the Terrestrial Biosphere , 2014 .

[37]  Andreas Huth,et al.  Connecting dynamic vegetation models to data – an inverse perspective , 2012 .

[38]  Ben Bond-Lamberty,et al.  The value of soil respiration measurements for interpreting and modeling terrestrial carbon cycling , 2017, Plant and Soil.

[39]  Cosmin Safta,et al.  Bayesian calibration of terrestrial ecosystem models: a study of advanced Markov chain Monte Carlo methods , 2017 .

[40]  H. Hendricks Franssen,et al.  Estimation of Community Land Model parameters for an improved assessment of net carbon fluxes at European sites , 2017 .

[41]  Marcel Oijen,et al.  Bayesian Methods for Quantifying and Reducing Uncertainty and Error in Forest Models , 2017 .

[42]  Pierre Friedlingstein,et al.  Uncertainties in CMIP5 Climate Projections due to Carbon Cycle Feedbacks , 2014 .

[43]  S. Pegov,et al.  Ecological Forecasting: “What for?” , 1992 .

[44]  H. Haario,et al.  An adaptive Metropolis algorithm , 2001 .

[45]  K. Davis,et al.  A multi-site analysis of random error in tower-based measurements of carbon and energy fluxes , 2006 .

[46]  M. Williams,et al.  Improving land surface models with FLUXNET data , 2009 .

[47]  M. G. Ryan,et al.  Carbon pools and fluxes in small temperate forest landscapes: Variability and implications for sampling design , 2010 .

[48]  Andy J. Keane,et al.  Recent advances in surrogate-based optimization , 2009 .

[49]  Jeremy E. Oakley,et al.  Calibration of Stochastic Computer Simulators Using Likelihood Emulation , 2017, Technometrics.

[50]  Kenton McHenry,et al.  A quantitative assessment of a terrestrial biosphere model's data needs across North American biomes , 2014 .

[51]  Sonja Kuhnt,et al.  Design and analysis of computer experiments , 2010 .

[52]  Markus Reichstein,et al.  Influences of observation errors in eddy flux data on inverse model parameter estimation , 2008 .

[53]  Ernst Linder,et al.  Estimating diurnal to annual ecosystem parameters by synthesis of a carbon flux model with eddy covariance net ecosystem exchange observations , 2005 .

[54]  Wei Gong,et al.  An adaptive surrogate modeling-based sampling strategy for parameter optimization and distribution estimation (ASMO-PODE) , 2017, Environ. Model. Softw..

[55]  J. Yeluripati,et al.  A Bayesian framework for model calibration, comparison and analysis: Application to four models for the biogeochemistry of a Norway spruce forest , 2011 .

[56]  Khachik Sargsyan,et al.  Bayesian Calibration of the Community Land Model Using Surrogates , 2012, SIAM/ASA J. Uncertain. Quantification.

[57]  D. Hollinger,et al.  Model-based analysis of the impact of diffuse radiation on CO2 exchange in a temperate deciduous forest , 2018 .

[58]  Natasha MacBean,et al.  Consistent assimilation of multiple data streams in a carbon cycle data assimilation system , 2016 .

[59]  R. Monson,et al.  Model‐data synthesis of diurnal and seasonal CO2 fluxes at Niwot Ridge, Colorado , 2006 .

[60]  S. Roxburgh,et al.  OptIC project: An intercomparison of optimization techniques for parameter estimation in terrestrial biogeochemical models , 2007 .

[61]  L. Swiler,et al.  On the applicability of surrogate‐based Markov chain Monte Carlo‐Bayesian inversion to the Community Land Model: Case studies at flux tower sites , 2016 .

[62]  M. R. R A U Pa C H,et al.  Model – data synthesis in terrestrial carbon observation : methods , data requirements and data uncertainty specifications , 2005 .