Comparing Sequential Forecasters

Consider two or more forecasters, each making a sequence of predictions for different events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts or outcomes were generated? This work presents a novel and rigorous answer to this question. We design a sequential inference procedure for estimating the time-varying difference in forecast quality as measured by any scoring rule. The resulting confidence intervals are nonasymptotically valid and can be continuously monitored to yield statistically valid comparisons at arbitrary data-dependent stopping times (“anytime-valid”); this is enabled by adapting variance-adaptive supermartingales, confidence sequences, and e-processes to our setting. Motivated by Shafer and Vovk’s game-theoretic probability, our coverage guarantees are also distribution-free, in the sense that they make no distributional assumptions on the forecasts or outcomes. In contrast to a recent work by Henzi and Ziegel, our tools can sequentially test a weak null hypothesis about whether one forecaster outperforms another on average over time. We demonstrate their effectiveness by comparing probability forecasts on Major League Baseball (MLB) games and statistical postprocessing methods for ensemble weather forecasts.

[1]  M. Schervish A General Method for Comparing Probability Assessors , 1989 .

[2]  A. O'Hagan,et al.  Statistical Methods for Eliciting Probability Distributions , 2005 .

[3]  T. Gneiting Making and Evaluating Point Forecasts , 2009, 0912.0902.

[4]  J McCarthy,et al.  MEASURES OF THE VALUE OF INFORMATION. , 1956, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Xiequan Fan,et al.  Exponential inequalities for martingales with applications , 2013, 1311.6273.

[6]  T. Lai Boundary Crossing Probabilities for Sample Sums and Confidence Sequences , 1976 .

[7]  F. J. Anscombe,et al.  Fixed-Sample-Size Analysis of Sequential Observations , 1954 .

[8]  Christopher Jennison,et al.  Interim analyses: the repeated confidence interval approach , 1989 .

[9]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[10]  L. J. Savage Elicitation of Personal Probabilities and Expectations , 1971 .

[11]  Vladimir Vovk,et al.  Game‐Theoretic Foundations for Probability and Finance , 2019, Wiley Series in Probability and Statistics.

[12]  Paul Mineiro,et al.  Off-policy Confidence Sequences , 2021, ICML.

[13]  H. Robbins,et al.  Boundary Crossing Probabilities for the Wiener Process and Sample Sums , 1970 .

[14]  R. L. Winkler Evaluating probabilities: asymmetric scoring rules , 1994 .

[15]  Jean-Luc Ville Étude critique de la notion de collectif , 1939 .

[16]  T. Shakespeare,et al.  Observational Studies , 2003 .

[17]  Evgeni Y. Ovcharov,et al.  Proper Scoring Rules and Bregman Divergences , 2015, 1502.01178.

[18]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[19]  L. Pekelis,et al.  Always Valid Inference: Bringing Sequential Analysis to A/B Testing , 2015, 1512.04922.

[20]  H. Robbins Statistical Methods Related to the Law of the Iterated Logarithm , 1970 .

[21]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[22]  H. Robbins,et al.  Confidence sequences for mean, variance, and median. , 1967, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Tilmann Gneiting,et al.  Of quantiles and expectiles: consistent scoring functions, Choquet representations and forecast rankings , 2015, 1503.08195.

[24]  Ute Dreher,et al.  Measure And Integration Theory , 2016 .

[25]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[26]  Marie Schmidt,et al.  Nonparametrics Statistical Methods Based On Ranks , 2016 .

[27]  Jasjeet S. Sekhon,et al.  Time-uniform, nonparametric, nonasymptotic confidence sequences , 2020, The Annals of Statistics.

[28]  A. H. Murphy,et al.  Scoring rules and the evaluation of probabilities , 1996 .

[29]  S. Vannitsem,et al.  Statistical Postprocessing for Weather Forecasts: Review, Challenges, and Avenues in a Big Data World , 2020, Bulletin of the American Meteorological Society.

[30]  Johanna F. Ziegel,et al.  Valid sequential inference on probability forecast performance , 2021, Biometrika.

[31]  Bo Waggoner,et al.  Linear Functions to the Extended Reals , 2021, ArXiv.

[32]  James S. Kennedy,et al.  EVALUATING PROBABILITY FORECASTS. , 1969 .

[33]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[34]  Ian A. Kash,et al.  General Truthfulness Characterizations Via Convex Analysis , 2012, WINE.

[35]  Wouter M. Koolen,et al.  Admissible anytime-valid sequential inference must rely on nonnegative martingales. , 2020, 2009.03167.

[36]  Jon D. McAuliffe,et al.  Time-uniform Chernoff bounds via nonnegative supermartingales , 2018, Probability Surveys.

[37]  Daniella Levine Cava Date , 2018, Definitions.

[39]  F. Molteni,et al.  The ECMWF Ensemble Prediction System: Methodology and validation , 1996 .

[40]  Halbert White,et al.  Tests of Conditional Predictive Ability , 2003 .

[41]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[42]  J. Doob Regularity properties of certain families of chance variables , 1940 .

[43]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[44]  Tilmann Gneiting,et al.  Isotonic distributional regression , 2021, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[45]  F. Diebold,et al.  Comparing Predictive Accuracy , 1994, Business Cycles.

[46]  T. Lai On Confidence Sequences , 1976 .

[47]  Jacob D. Abernethy,et al.  A Characterization of Scoring Rules for Linear Properties , 2012, COLT.

[48]  Wouter M. Koolen,et al.  Testing exchangeability: Fork-convexity, supermartingales and e-processes , 2021, Int. J. Approx. Reason..

[49]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[50]  Aaditya Ramdas,et al.  Estimating means of bounded random variables by betting , 2020 .

[51]  Glenn Shafer,et al.  Author's reply to the Discussion of ‘Testing by betting: A strategy for statistical and scientific communication’ by Glenn Shafer , 2021, Journal of the Royal Statistical Society: Series A (Statistics in Society).

[52]  Akimichi Takemura,et al.  Defensive Forecasting , 2005, AISTATS.

[53]  David Arbour,et al.  Doubly robust confidence sequences for sequential causal inference , 2021 .

[54]  W. Ehm,et al.  Forecast dominance testing via sign randomization , 2017, 1707.03035.

[55]  Stefano Ermon,et al.  Adaptive Concentration Inequalities for Sequential Decision Problems , 2016, NIPS.

[56]  Lalit Jain,et al.  A Bandit Approach to Multiple Testing with False Discovery Control , 2018, ArXiv.

[57]  A. Dawid,et al.  Theory and applications of proper scoring rules , 2014, 1401.0398.

[58]  V. Vovk,et al.  E-values: Calibration, combination, and applications , 2019 .

[59]  Achim Zeileis,et al.  Extending Extended Logistic Regression: Extended versus Separate versus Ordered versus Censored , 2014 .