Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts

Multivariate probabilistic time series forecasts are commonly evaluated via proper scoring rules, i.e., functions that are minimal in expectation for the ground-truth distribution. However, this property is not sufficient to guarantee good discrimination in the non-asymptotic regime. In this paper, we provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation. Through a power analysis, we identify the"region of reliability"of a scoring rule, i.e., the set of practical conditions where it can be relied on to identify forecasting errors. We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions, and we gauge the generalizability of our findings to real-world tasks with an application to an electricity production problem. Our results reveal critical shortcomings in the evaluation of multivariate probabilistic forecasts as commonly performed in the literature.

[1]  Alexandre Drouin,et al.  TACTiS: Transformer-Attentional Copulas for Time Series , 2022, ICML.

[2]  A. Dengel,et al.  Random Noise vs State-of-the-Art Probabilistic Forecasting Methods : A Case Study on CRPS-Sum Discrimination Ability , 2022, Applied Sciences.

[3]  R. L. Winkler,et al.  The M5 uncertainty competition: Results, findings and conclusions , 2021, International Journal of Forecasting.

[4]  Pablo Montero-Manso,et al.  A Look at the Evaluation Setup of the M5 Forecasting Competition , 2021, ArXiv.

[5]  Stefano Ermon,et al.  CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation , 2021, NeurIPS.

[6]  Geoffrey I. Webb,et al.  Monash Time Series Forecasting Archive , 2021, NeurIPS Datasets and Benchmarks.

[7]  Carol Alexander,et al.  Evaluating the discrimination ability of proper multi-variate scoring rules , 2021, Annals of Operations Research.

[8]  Nam Nguyen,et al.  Temporal Latent Auto-Encoder: A Method for Probabilistic Multivariate Time Series Forecasting , 2021, AAAI.

[9]  Ingmar Schuster,et al.  Multi-variate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows , 2020, ICLR.

[10]  Florian Ziel,et al.  Multivariate Forecasting Evaluation: On Sensitive and Strictly Proper Scoring Rules , 2019, 1910.07325.

[11]  Michael Bohlke-Schneider,et al.  High-Dimensional Multivariate Forecasting with Low-Rank Gaussian Copula Processes , 2019, NeurIPS.

[12]  Evangelos Spiliotis,et al.  The M4 Competition: Results, findings, conclusion and way forward , 2018, International Journal of Forecasting.

[13]  T. Hamill,et al.  Variogram-Based Proper Scoring Rules for Probabilistic Forecasts of Multivariate Quantities* , 2015 .

[14]  Fredo Schotanus,et al.  Operations management: sustainability and supply chain management , 2013 .

[15]  T. Gneiting,et al.  Comparing Density Forecasts Using Threshold- and Quantile-Weighted Scoring Rules , 2011 .

[16]  Martin Peterson,et al.  An Introduction to Decision Theory , 2009 .

[17]  L. Held,et al.  Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds , 2008 .

[18]  Patrick Dattalo,et al.  Statistical Power Analysis , 2008 .

[19]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[20]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[21]  Yong Bao,et al.  Comparing Density Forecast Models , 2007 .

[22]  Rob J Hyndman,et al.  Another look at measures of forecast accuracy , 2006 .

[23]  Zongwu Cai,et al.  REGRESSION QUANTILES FOR TIME SERIES , 2002, Econometric Theory.

[24]  T. Hamill Interpretation of Rank Histograms for Verifying Ensemble Forecasts , 2001 .

[25]  Spyros Makridakis,et al.  The M3-Competition: results, conclusions and implications , 2000 .

[26]  Anthony S. Tay,et al.  Evaluating Density Forecasts with Applications to Financial Risk Management , 1998 .

[27]  A. H. Murphy,et al.  Scoring rules and the evaluation of probabilities , 1996 .

[28]  Chris Chatfield,et al.  Calculating Interval Forecasts , 1993 .

[29]  Essam Mahmoud,et al.  Accuracy in forecasting: A survey , 1984 .

[30]  Robert L. Winkler,et al.  The accuracy of extrapolation (time series) methods: Results of a forecasting competition , 1982 .

[31]  Michèle Hibon,et al.  Accuracy of Forecasting: An Empirical Investigation , 1979 .

[32]  R. L. Winkler,et al.  Scoring Rules for Continuous Probability Distributions , 1976 .

[33]  D. Matteson,et al.  Probabilistic Transformer For Time Series Analysis , 2021, NeurIPS.

[34]  Patrick Gallinari,et al.  Normalizing Kalman Filters for Multivariate Time Series Analysis , 2020, NeurIPS.

[35]  Lisa Werner,et al.  Principles of forecasting: A handbook for researchers and practitioners , 2002 .

[36]  Ute Beyer,et al.  Bayesian Forecasting And Dynamic Models , 2016 .

[37]  Pierre Pinson,et al.  Discrimination ability of the Energy score , 2013 .

[38]  Chris Chatfield,et al.  Prediction Intervals for Time-Series Forecasting , 2001 .

[39]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .