Significance of changes in medium-range forecast scores

The impact of developments in weather forecasting is measured using forecast verification, but many developments, though useful, have impacts of less than 0.5 % on medium-range forecast scores. Chaotic variability in the quality of individual forecasts is so large that it can be hard to achieve statistical significance when comparing these ‘smaller’ developments to a control. For example, with 60 separate forecasts and requiring a 95 % confidence level, a change in quality of the day-5 forecast needs to be larger than 1 % to be statistically significant using a Student's t-test. The first aim of this study is simply to illustrate the importance of significance testing in forecast verification, and to point out the surprisingly large sample sizes that are required to attain significance. The second aim is to see how reliable are current approaches to significance testing, following suspicion that apparently significant results may actually have been generated by chaotic variability. An independent realisation of the null hypothesis can be created using a forecast experiment containing a purely numerical perturbation, and comparing it to a control. With 1885 paired differences from about 2.5 yr of testing, an alternative significance test can be constructed that makes no statistical assumptions about the data. This is used to experimentally test the validity of the normal statistical framework for forecast scores, and it shows that the naive application of Student's t-test does generate too many false positives (i.e. false rejections of the null hypothesis). A known issue is temporal autocorrelation in forecast scores, which can be corrected by an inflation in the size of confidence range, but typical inflation factors, such as those based on an AR(1) model, are not big enough and they are affected by sampling uncertainty. Further, the importance of statistical multiplicity has not been appreciated, and this becomes particularly dangerous when many experiments are compared together. For example, across three forecast experiments, there could be roughly a 1 in 2 chance of getting a false positive. However, if correctly adjusted for autocorrelation, and when the effects of multiplicity are properly treated using a Šidák correction, the t-test is a reliable way of finding the significance of changes in forecast scores.

[1]  I. Jolliffe Uncertainty and Inference for Verification Measures , 2007 .

[2]  P. Courtier,et al.  The ECMWF implementation of three‐dimensional variational assimilation (3D‐Var). I: Formulation , 1998 .

[3]  H. Abdi The Bonferonni and Šidák Corrections for Multiple Comparisons , 2006 .

[4]  Daniel S. Wilks,et al.  Resampling Hypothesis Tests for Autocorrelated Fields , 1997 .

[5]  C. Piccolo,et al.  Verification against perturbed analyses and observations , 2015 .

[6]  S. Majumdar,et al.  The contamination of ‘data impact’ in global models by rapidly growing mesoscale instabilities , 2007 .

[7]  P. Bauer,et al.  619 Direct 4 D-Var assimilation of all-sky radiances . Part II : Assessment , 2010 .

[8]  Barbara G. Brown,et al.  Forecast verification: current status and future directions , 2008 .

[9]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[10]  Susan Joslyn,et al.  Progress and challenges in forecast verification , 2013 .

[11]  A. H. Murphy,et al.  Skill Scores Based on the Mean Square Error and Their Relationships to the Correlation Coefficient , 1988 .

[12]  Philippe Courtier,et al.  The ECMWF implementation of three-dimensional variational assimilation ( 3 D-Var ) . 111 : Experimental results , 2006 .

[13]  R. E. Livezey,et al.  Statistical Field Significance and its Determination by Monte Carlo Techniques , 1983 .

[14]  Massimo Bonavita,et al.  The evolution of the ECMWF hybrid data assimilation system , 2016 .

[15]  A. Hollingsworth,et al.  Some aspects of the improvement in skill of numerical weather prediction , 2002 .

[16]  Roger Daley,et al.  Observation and background adjoint sensitivity in the adaptive observation‐targeting problem , 2007 .

[17]  Peter Bauer,et al.  Scaling of GNSS Radio Occultation Impact with Observation Number Using an Ensemble of Data Assimilations , 2013 .

[18]  E. Lorenz Atmospheric predictability experiments with a large numerical model , 1982 .

[19]  Linus Magnusson,et al.  Factors Influencing Skill Improvements in the ECMWF Forecasting System , 2013 .

[20]  E. Lorenz Deterministic nonperiodic flow , 1963 .

[21]  W. Briggs Statistical Methods in the Atmospheric Sciences , 2007 .

[22]  N. Bormann,et al.  All-sky assimilation of microwave humidity sounders , 2016 .

[23]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[24]  I. Jolliffe,et al.  Forecast verification : a practitioner's guide in atmospheric science , 2011 .