An Ensemble-Based Statistical Methodology to Detect Differences in Weather and Climate Model Executables

Abstract. Since their first operational application in the 1950s, atmospheric numerical models have become essential tools in weather and climate prediction. As such, they are subject to continuous changes, thanks to advances in computer systems, numerical methods, more and better observations, and the ever increasing knowledge about the atmosphere of Earth. Many of the changes in today’s models relate to seemingly unsuspicious modifications, associated with minor code rearrangements, changes in hardware infrastructure, or software updates. Such changes are not supposed to significantly affect the model. However, this is difficult to verify, because our atmosphere is a chaotic system, where even a tiny change can have a big impact on individual simulations. Overall this represents a serious challenge to a consistent model development and maintenance framework. Here we propose a new methodology for quantifying and verifying the impacts of minor atmospheric model changes, or its underlying hardware/software system, by using a set of simulations with slightly different initial conditions in combination with a statistical hypothesis test. The methodology can assess effects of model changes on almost any output variable over time, and can also be used with different underlying statistical hypothesis tests. We present first applications of the methodology with a regional weather and climate model, including the verification of a major system update of the underlying supercomputer. While providing very robust results, the methodology shows a great sensitivity even to tiny changes. Results show that changes are often only detectable during the first hours, which suggests that short-term simulations (days to months) are best suited for the methodology, even when addressing long-term climate simulations. We also show that the choice of the underlying statistical hypothesis test is not of importance and that the methodology already works well for coarse resolutions, making it computationally inexpensive and therefore an ideal candidate for automated testing.

[1]  M. Tiedtke A Comprehensive Mass Flux Scheme for Cumulus Parameterization in Large-Scale Models , 1989 .

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Katherine J. Evans,et al.  Exploring an Ensemble-Based Approach to Atmospheric Climate Modeling and Testing at Scale , 2017, ICCS.

[4]  Hans von Storch,et al.  A Remark on Chervin-Schneider's Algorithm to Test Significance of Climate Experiments with GCM's , 1982 .

[5]  C. Schär,et al.  Model intercomparison of COSMO 5.0 and IFS 45r1 at kilometer-scale grid spacing , 2021 .

[6]  Chris G. Knight,et al.  Association of parameter, software, and hardware variation with large-scale behavior across 57,000 climate models , 2007, Proceedings of the National Academy of Sciences.

[7]  Peter Bauer,et al.  The quiet revolution of numerical weather prediction , 2015, Nature.

[8]  François Lott,et al.  A new subgrid‐scale orographic drag parametrization: Its formulation and testing , 1997 .

[9]  T. Mauritsen,et al.  Improving a global model from the boundary layer: Total turbulent energy and the neutral limit Prandtl number , 2015 .

[10]  Robert E. Livezey Statistical Analysis of General Circulation Model Climate Simulation: Sensitivity and Prediction Experiments , 1985 .

[11]  E. Lorenz Deterministic nonperiodic flow , 1963 .

[12]  David L. Williamson,et al.  The Accumulation of Rounding Errors and Port Validation for Global Atmospheric Models , 1997, SIAM J. Sci. Comput..

[13]  F. Doblas-Reyes,et al.  Replicability of the EC-Earth3 Earth system model under a change in computing environment , 2019, Geoscientific Model Development.

[14]  Tim N. Palmer,et al.  Ensemble forecasting , 2008, J. Comput. Phys..

[15]  Jim Edwards,et al.  A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (TSC1.0) , 2016 .

[16]  Thomas L. Clune,et al.  Software Testing and Verification in Climate Model Development , 2011 .

[17]  Sheri Mickelson,et al.  A new ensemble-based consistency test for the Community Earth System Model (pyCECT v1.0) , 2015 .

[18]  Michel Roch,et al.  The subgrid‐scale orographic blocking parametrization of the GEM Model , 2003 .

[19]  J. Thepaut,et al.  The ERA‐Interim reanalysis: configuration and performance of the data assimilation system , 2011 .

[20]  T. Reichler,et al.  How Well Do Coupled Models Simulate Today's Climate? , 2008 .

[21]  A. Hense,et al.  The Regional Climate Model COSMO-CLM (CCLM) , 2008 .

[22]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[23]  Omar Bellprat,et al.  Objective calibration of regional climate models: OBJECTIVE CALIBRATION OF RCMS , 2012 .

[24]  Elizabeth R. Jessup,et al.  Nine time steps: ultra-fast statistical consistency testing of the Community Earth System Model (pyCECT v3.0) , 2017 .

[25]  Min Xu,et al.  A Multivariate Approach to Ensure Statistical Reproducibility of Climate Model Simulations , 2019, PASC.

[26]  B. Ritter,et al.  A comprehensive radiation scheme for numerical weather prediction models with potential applications in climate simulations , 1992 .

[27]  R B D'Agostino,et al.  Robustness of the t Test Applied to Data Distorted from Normality by Floor Effects , 1992, Journal of dental research.

[28]  M. S. Bartlett,et al.  The Effect of Non-Normality on the t Distribution , 1935, Mathematical Proceedings of the Cambridge Philosophical Society.

[29]  D. W. Zimmerman Comparative Power of Student T Test and Mann-Whitney U Test for Unequal Sample Sizes and Variances , 1987 .

[30]  M. Baldauf,et al.  Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities , 2011 .

[31]  P. Bechtold,et al.  Why is it so difficult to represent stably stratified conditions in numerical weather prediction (NWP) models? , 2013 .

[32]  Ramón de Elía,et al.  Objective Calibration of Regional Climate Models: Application over Europe and North America , 2014 .

[33]  Douglas W. Nychka,et al.  A new ensemble-based consistency test for the Community Earth System Model , 2015 .

[34]  Harry O. Posten,et al.  Robustness of the Two-Sample T-Test , 1984 .

[35]  Tobias Gysi,et al.  Towards a performance portable, architecture agnostic implementation strategy for weather and climate models , 2014, Supercomput. Front. Innov..

[36]  R. E. Livezey,et al.  Statistical Field Significance and its Determination by Monte Carlo Techniques , 1983 .

[37]  Louis J. Wicker,et al.  Time-Splitting Methods for Elastic Models Using Forward Time Schemes , 2002 .

[38]  D. S. Wilks,et al.  “The Stippling Shows Statistically Significant Grid Points”: How Research Results are Routinely Overstated and Overinterpreted, and What to Do about It , 2016 .

[39]  Guangwen Yang,et al.  Evaluating statistical consistency in the ocean model component of the Community Earth System Model (pyCECT v2.0) , 2016 .

[40]  D. Lüthi,et al.  A Groundwater and Runoff Formulation for Weather and Climate Models , 2018, Journal of Advances in Modeling Earth Systems.

[41]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[42]  Rand R. Wilcox,et al.  Some practical reasons for reconsidering the Kolmogorov‐Smirnov test , 1997 .

[43]  Stephen J. Thomas,et al.  An Ensemble Analysis of Forecast Errors Related to Floating Point Performance , 2002 .