Valid sequential inference on probability forecast performance

Probability forecasts for binary events play a central role in many applications. Their quality is commonly assessed with proper scoring rules, which assign forecasts a numerical score such that a correct forecast achieves a minimal expected score. In this paper, we construct e-values for testing the statistical significance of score differences of competing forecasts in sequential settings. E-values have been proposed as an alternative to p-values for hypothesis testing, and they can easily be transformed into conservative p-values by taking the multiplicative inverse. The e-values proposed in this article are valid in finite samples without any assumptions on the data generating processes. They also allow optional stopping, so a forecast user may decide to interrupt evaluation taking into account the available data at any time and still draw statistically valid inference, which is generally not true for classical p-value based tests. In a case study on postprocessing of precipitation forecasts, state-of-the-art forecasts dominance tests and e-values lead to the same conclusions.

[1]  Wouter M. Koolen,et al.  Admissible anytime-valid sequential inference must rely on nonnegative martingales. , 2020, 2009.03167.

[2]  Tilmann Gneiting,et al.  Of quantiles and expectiles: consistent scoring functions, Choquet representations and forecast rankings , 2015, 1503.08195.

[3]  Tim N. Palmer,et al.  Ensemble forecasting , 2008, J. Comput. Phys..

[4]  A. Dawid,et al.  On Testing the Validity of Sequential Probability Forecasts , 1993 .

[5]  James S. Kennedy,et al.  EVALUATING PROBABILITY FORECASTS. , 1969 .

[6]  Ruodu Wang,et al.  True and false discoveries with independent e-values , 2020, 2003.00593.

[7]  F. Diebold,et al.  Comparing Predictive Accuracy , 1994, Business Cycles.

[8]  R. Buizza,et al.  A Comparison of the ECMWF, MSC, and NCEP Global Ensemble Prediction Systems , 2005 .

[9]  M. Schervish A General Method for Comparing Probability Assessors , 1989 .

[10]  V. Vovk,et al.  E-values: Calibration, combination, and applications , 2019 .

[11]  Andrew J. Patton Comparing Possibly Misspecified Forecasts , 2020, Journal of Business & Economic Statistics.

[12]  Can Two Forecasts Have the Same Conditional Expected Accuracy , 2020, 2006.03238.

[13]  Achim Zeileis,et al.  Extending Extended Logistic Regression: Extended versus Separate versus Ordered versus Censored , 2014 .

[14]  W. Ehm,et al.  Forecast dominance testing via sign randomization , 2017, 1707.03035.

[15]  T. Gneiting,et al.  Combining probability forecasts , 2010 .

[16]  Halbert White,et al.  Tests of Conditional Predictive Ability , 2003 .

[17]  T. Yen,et al.  Testing Forecast Accuracy of Expectiles and Quantiles with the Extremal Consistent Loss Functions , 2017, International Journal of Forecasting.

[18]  Tilmann Gneiting,et al.  Isotonic distributional regression , 2021, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[19]  Mark W. Watson,et al.  HAR Inference: Recommendations for Practice , 2018, Journal of Business & Economic Statistics.

[20]  T. Gneiting,et al.  Combining Predictive Distributions , 2011, 1106.1638.

[21]  T. Gneiting Making and Evaluating Point Forecasts , 2009, 0912.0902.

[22]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[23]  Aaditya Ramdas,et al.  Estimating means of bounded random variables by betting , 2020 .

[24]  Francesco Ravazzolo,et al.  Forecaster's Dilemma: Extreme Events and Forecast Evaluation , 2015, 1512.09244.

[25]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[26]  F. Molteni,et al.  The ECMWF Ensemble Prediction System: Methodology and validation , 1996 .

[27]  A. H. Murphy,et al.  Scoring rules and the evaluation of probabilities , 1996 .

[28]  G. Shafer The Language of Betting as a Strategy for Statistical and Scientific Communication , 2019, 1903.06991.