Understanding forecast verification statistics

Although there are numerous reasons for performing a verification analysis, there are usually two general questions that are of interest: are the forecasts good, and can we be confident that the estimate of forecast quality is not misleading? When calculating a verification score, it is not usually obvious how the score can answer either of these questions. Some procedures for attempting to answer the questions are reviewed, with particular focus on p-values and confidence intervals. P-values are shown to be rather unhelpful in answering either question, especially when applied to probabilistic verification scores, and confidence intervals are to be preferred. However, confidence intervals cannot reveal biases in the value of a score that arises from an inadequate experimental design for testing on truly out-of-sample observations. Some specific problems with cross validation are highlighted. Finally, in the interests of increasing the insight into forecast strengths and weaknesses and in pointing towards methods for improving forecast quality, a plea is made for a more discriminating selection of verification procedures than has been adopted to date. Copyright © 2008 Royal Meteorological Society

[1]  J. Michaelsen Cross-Validation in Statistical Climate Forecast Models , 1987 .

[2]  Benjamin Kirtman,et al.  Decadal Variability in ENSO Predictability and Prediction , 1998 .

[3]  A. H. Murphy On the “Ranked Probability Score” , 1969 .

[4]  Mark S. Roulston,et al.  Performance targets and the Brier score , 2007 .

[5]  A. H. Murphy The Finley Affair: A Signal Event in the History of Forecast Verification , 1996 .

[6]  I. Jolliffe Uncertainty and Inference for Verification Measures , 2007 .

[7]  Jun Zhu,et al.  Resampling methods for spatial regression models under a class of stochastic designs , 2006, math/0611261.

[8]  A. H. Murphy A Note on the Ranked Probability Score , 1971 .

[9]  R. Nau Should Scoring Rules be Effective , 1985 .

[10]  A. H. Murphy,et al.  THE RANKED PROBABILITY SCORE AND THE PROBABILITY SCORE: A COMPARISON , 1970 .

[11]  Daniel S. Wilks,et al.  Resampling Hypothesis Tests for Autocorrelated Fields , 1997 .

[12]  Ian T. Jolliffe The impenetrable hedge: a note on propriety, equitability and consistency , 2008 .

[13]  Léon Personnaz,et al.  On Cross Validation for Model Selection , 1999, Neural Computation.

[14]  N. Graham,et al.  Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation , 2002 .

[15]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[16]  Thomas M. Hamill,et al.  Measuring forecast skill: is it real skill or is it the varying climatology? , 2006 .

[17]  William R. Burrows,et al.  A Strategy for Verification of Weather Element Forecasts from an Ensemble Prediction System , 1999 .

[18]  D. Wilks Multisite generalization of a daily stochastic precipitation generation model , 1998 .

[19]  D. Chelton Effects of sampling errors in statistical estimation , 1983 .

[20]  Harald Daan Sensitivity of Verification Scores to the Classification of the Predictand , 1985 .

[21]  A. C. Rencher,et al.  Inflation of R2 in Best Subset Regression , 1980 .

[22]  A. H. Murphy A New Vector Partition of the Probability Score , 1973 .

[23]  Daniel S. Wilks,et al.  On “Field Significance” and the False Discovery Rate , 2006 .

[24]  Ming Ji,et al.  Coupled Model Predictions of ENSO during the 1980s and the 1990s at the National Centers for Environmental Prediction. , 1996 .

[25]  A. H. Murphy,et al.  What Is a Good Forecast? An Essay on the Nature of Goodness in Weather Forecasting , 1993 .

[26]  R. Katz Use of cross correlations in the search for teleconnections , 1988 .

[27]  F. Zwiers,et al.  Statistical Considerations for Climate Experiments. Part II: Multivariate Tests , 1987 .

[28]  Carl P. Schmertmann,et al.  Assessing Forecast Skill through Cross Validation , 1994 .

[29]  A. H. Murphy,et al.  A General Framework for Forecast Verification , 1987 .

[30]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[31]  Joseph P. Gerrity,et al.  A note on Gandin and Murphy's equitable skill score , 1992 .

[32]  Simon J. Mason,et al.  Comparison of Some Statistical Methods of Probabilistic Forecasting of ENSO. , 2002 .

[33]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[34]  Leonard A. Smith,et al.  Scoring Probabilistic Forecasts: The Importance of Being Proper , 2007 .

[35]  Valérie Ventura,et al.  Controlling the Proportion of Falsely Rejected Hypotheses when Conducting Multiple Tests with Climatological Data , 2004 .

[36]  A. H. Murphy,et al.  Equitable Skill Scores for Categorical Forecasts , 1992 .

[37]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[38]  Leonard A. Smith,et al.  Evaluating Probabilistic Forecasts Using Information Theory , 2002 .

[39]  Edward S. Epstein,et al.  A Scoring System for Probability Forecasts of Ranked Categories , 1969 .

[40]  W. Briggs Statistical Methods in the Atmospheric Sciences , 2007 .

[41]  M. Ward,et al.  Prediction of seasonal rainfall in the north nordeste of Brazil using eigenvectors of sea‐surface temperature , 2007 .

[42]  R. E. Livezey,et al.  Statistical Field Significance and its Determination by Monte Carlo Techniques , 1983 .

[43]  A. Agresti An introduction to categorical data analysis , 1997 .

[44]  Russ E. Davis,et al.  Predictability of Sea Surface Temperature and Sea Level Pressure Anomalies over the North Pacific Ocean , 1976 .

[45]  A. Agresti Categorical data analysis , 1993 .

[46]  Ian T. Jolliffe,et al.  P stands for … , 2004 .

[47]  Leland Wilkinson,et al.  Tests of Significance in Forward Selection Regression With an F-to-Enter Stopping Rule , 1981 .

[48]  Michael E. Baldwin,et al.  Field Significance Revisited: Spatial Bias Errors in Forecasts as Applied to the Eta Model , 2006 .

[49]  Ian T. Jolliffe,et al.  Revised “LEPS” Scores for Assessing Climate Model Simulations and Long-Range Forecasts , 1996 .

[50]  A. Barnston,et al.  A Degeneracy in Cross-Validated Skill in Regression-based Forecasts , 1993 .

[51]  A. H. Murphy Forecast verification: Its Complexity and Dimensionality , 1991 .

[52]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[53]  Simon J. Mason,et al.  On Using ``Climatology'' as a Reference Strategy in the Brier and Ranked Probability Skill Scores , 2004 .

[54]  Neville Nicholls,et al.  commentary and analysis: The Insignificance of Significance Testing , 2001 .

[55]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[56]  Barbara G. Brown,et al.  The problem of multiplicity in research on teleconnections , 2007 .

[57]  I. Jolliffe,et al.  Proper Scores for Probability Forecasts Can Never Be Equitable , 2008 .

[58]  D. Friedman Effective Scoring Rules for Probabilistic Forecasts , 1983 .