Using Stacking to Average Bayesian Predictive Distributions (with Discussion)

The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.

[1]  Marco A. R. Ferreira,et al.  Spatiotemporal Models for Gaussian Areal Data , 2009 .

[2]  A. V. D. Vaart,et al.  BAYESIAN LINEAR REGRESSION WITH SPARSE PRIORS , 2014, 1403.0735.

[3]  Philippe Naveau,et al.  Modeling jointly low, moderate, and heavy rainfall intensities without a threshold selection , 2016 .

[4]  Mike West,et al.  Multivariate Bayesian Predictive Synthesis in Macroeconomic Forecasting , 2017, Journal of the American Statistical Association.

[5]  Thiago G. Martins,et al.  Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors , 2014, 1403.4630.

[6]  Yael Grushka-Cockayne,et al.  Is it Better to Average Probabilities or Quantiles? , 2012, Manag. Sci..

[7]  Lennart F. Hoogerheide,et al.  Time-Varying Combinations of Bayesian Dynamic Models and Equity Momentum Strategies , 2016 .

[8]  Aki Vehtari,et al.  Sparsity information and regularization in the horseshoe and other shrinkage priors , 2017, 1707.01694.

[9]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[10]  Francesco Ravazzolo,et al.  Combined Density Nowcasting in an Uncertain Economic Environment , 2014 .

[11]  S. Geisser,et al.  A Predictive Approach to Model Selection , 1979 .

[12]  Raul Cano On The Bayesian Bootstrap , 1992 .

[13]  Finn Lindgren,et al.  Bayesian computing with INLA: New features , 2012, Comput. Stat. Data Anal..

[14]  J. Geweke,et al.  Smoothly mixing regressions , 2007 .

[15]  Gianni Amisano,et al.  Prediction with Misspecified Models , 2012 .

[16]  A. Raftery,et al.  Using Bayesian Model Averaging to Calibrate Forecast Ensembles , 2005 .

[17]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[18]  Yael Grushka-Cockayne,et al.  Bayesian Ensembles of Binary-Event Forecasts: When Is It Appropriate to Extremize or Anti-Extremize? , 2017, 1705.02391.

[19]  M. Clyde,et al.  Mixtures of g Priors for Bayesian Variable Selection , 2008 .

[20]  David B. Dunson,et al.  A framework for probabilistic inferences from imperfect models , 2016 .

[21]  Vu,et al.  Time-Varying Combinations of Predictive Densities Using Nonlinear Filtering , 2012 .

[22]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[23]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[24]  S. Martino Approximate Bayesian Inference for Latent Gaussian Models , 2007 .

[25]  C. Robert,et al.  Testing hypotheses via a mixture estimation model , 2014, 1412.2044.

[26]  M. West,et al.  Modelling Probabilistic Agent Opinion , 1992 .

[27]  Sumio Watanabe,et al.  Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory , 2010, J. Mach. Learn. Res..

[28]  Carsten F. Dormann,et al.  Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure , 2017 .

[29]  Mike West,et al.  Dynamic Bayesian predictive synthesis in time series forecasting , 2016, Journal of Econometrics.

[30]  Aki Vehtari,et al.  Understanding predictive information criteria for Bayesian models , 2013, Statistics and Computing.

[31]  W. Hays Statistical theory. , 1968, Annual review of psychology.

[32]  Kert Viele,et al.  Nonparametric estimation of Kullback-Leibler information illustrated by evaluating goodness of fit , 2007 .

[33]  Grace L. Yang,et al.  On Bayes Procedures , 1990 .

[34]  Aki Vehtari,et al.  Projection predictive model selection for Gaussian processes , 2015, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[35]  N. Narisetty,et al.  Bayesian variable selection with shrinking and diffusing priors , 2014, 1405.6545.

[36]  Bertrand Clarke,et al.  Improvement over bayes prediction in small samples in the presence of model uncertainty , 2004 .

[37]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[38]  David B. Dunson,et al.  Minimax Optimal Bayesian Aggregation , 2014 .

[39]  Eduardo Gutiérrez-Peña,et al.  Objective parametric model selection procedures from a Bayesian nonparametric perspective , 2009, Comput. Stat. Data Anal..

[40]  S. Lauritzen,et al.  Proper local scoring rules , 2011, 1101.5011.

[41]  Masashi Sugiyama,et al.  Input-dependent estimation of generalization error under covariate shift , 2005 .

[42]  Edward I. George,et al.  Dilution priors: Compensating for model space redundancy , 2010 .

[43]  E. Wagenmakers,et al.  AIC model selection using Akaike weights , 2004, Psychonomic bulletin & review.

[44]  Sylvia Frühwirth-Schnatter,et al.  Finite Mixture and Markov Switching Models , 2006 .

[45]  Marco Del Negro,et al.  Dynamic Prediction Pools: An Investigation of Financial Frictions and Forecasting Performance , 2014 .

[46]  J. Geweke,et al.  Optimal Prediction Pools , 2008 .

[47]  Ambuj Tewari,et al.  Online learning via sequential complexities , 2010, J. Mach. Learn. Res..

[48]  T. Gneiting Quantiles as optimal point forecasts , 2011 .

[49]  Jacob M. Montgomery,et al.  Bayesian Model Averaging: Theoretical Developments and Practical Applications , 2010, Political Analysis.

[50]  Ole Winther,et al.  Bayesian Leave-One-Out Cross-Validation Approximations for Gaussian Latent Variable Models , 2014, J. Mach. Learn. Res..

[51]  M. West,et al.  Shotgun Stochastic Search for “Large p” Regression , 2007 .

[52]  Dustin Tran,et al.  Automatic Differentiation Variational Inference , 2016, J. Mach. Learn. Res..

[53]  S. Ghosal,et al.  Adaptive Bayesian Procedures Using Random Series Priors , 2014, 1403.0625.

[54]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[55]  A. Dawid,et al.  Minimum Scoring Rule Inference , 2014, 1403.3920.

[56]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[57]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[58]  Stephen G. Walker,et al.  Statistical Decision Problems and Bayesian Nonparametric Methods , 2005 .

[59]  A. Dawid,et al.  Theory and applications of proper scoring rules , 2014, 1401.0398.

[60]  Bertrand Clarke,et al.  A Bayes interpretation of stacking for M-complete and M-open settings , 2016, 1602.05162.

[61]  Merlise A. Clyde,et al.  Bayesian Model Averaging in the M-Open Framework , 2013 .

[62]  Haavard Rue,et al.  Spatial modelling with R-INLA: A review , 2018, 1802.06350.

[63]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[64]  Fernando Pérez-Cruz,et al.  Kullback-Leibler divergence estimation of continuous distributions , 2008, 2008 IEEE International Symposium on Information Theory.

[65]  Stephen C. Hora,et al.  Probability Judgments for Continuous Quantities: Linear Combinations and Calibration , 2004, Manag. Sci..

[66]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[67]  N. Hjort,et al.  The Focused Information Criterion , 2003 .

[68]  Hirotugu Akaike,et al.  On the Likelihood of a Time Series Model , 1978 .

[69]  Peter Grünwald,et al.  Using Stacking to Average Bayesian Predictive Distributions (with Discussion) , 2018 .

[70]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[71]  M. Stone An Asymptotic Equivalence of Choice of Model by Cross‐Validation and Akaike's Criterion , 1977 .

[72]  Fulvio Spezzaferri,et al.  Alternative Bayes factors for model selection , 1997 .

[73]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[74]  R. Tibshirani,et al.  Combining Estimates in Regression and Classification , 1996 .

[75]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[76]  Bertrand Clarke,et al.  Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot be Ignored , 2003, J. Mach. Learn. Res..

[77]  James Adams,et al.  Representation in Congressional Campaigns: Evidence for Discounting/Directional Voting in U.S. Senate Elections , 2004, Journal of Politics.

[78]  Aki Vehtari,et al.  Comparison of Bayesian predictive methods for model selection , 2015, Stat. Comput..

[79]  Alan E. Gelfand,et al.  Model Determination using sampling-based methods , 1996 .

[80]  Adrian E. Raftery,et al.  Bayesian Model Averaging , 1998 .

[81]  Leif Anders Thorsrud,et al.  Nowcasting GDP in Real Time: A Density Combination Approach , 2014 .

[82]  Yingbin Liang,et al.  Estimation of KL Divergence: Optimal Minimax Rate , 2016, IEEE Transactions on Information Theory.

[83]  Jouko Lampinen,et al.  Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities , 2002, Neural Computation.

[84]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[85]  Paola Sebastiani,et al.  Coherent dispersion criteria for optimal experimental design , 1999 .

[86]  H. Rue,et al.  INLA goes extreme: Bayesian tail regression for the estimation of high spatio-temporal quantiles , 2018, Extremes.

[87]  Meng Li,et al.  Bayesian detection of image boundaries , 2015, 1508.05847.

[88]  George Kapetanios,et al.  Generalised Density Forecast Combinations , 2014 .

[89]  A. Gelman Parameterization and Bayesian Modeling , 2004 .

[90]  Michael J. Pazzani,et al.  A Principal Components Approach to Combining Regression Estimates , 1999, Machine Learning.

[91]  Aki Vehtari,et al.  Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC , 2015, Statistics and Computing.

[92]  Padhraic Smyth,et al.  Stacked Density Estimation , 1997, NIPS.

[93]  Robert L. Winkler,et al.  Evaluating Quantile Assessments , 2009, Oper. Res..

[94]  S. Hall,et al.  Combining density forecasts , 2007 .

[95]  Haavard Rue,et al.  Bayesian Computing with INLA: A Review , 2016, 1604.00860.

[96]  Van Der Vaart,et al.  Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth , 2009, 0908.3556.

[97]  M. West,et al.  Modelling Agent Forecast Distributions , 1992 .

[98]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[99]  Yael Grushka-Cockayne,et al.  Quantile Evaluation, Sensitivity to Bracketing, and Sharing Business Payoffs , 2016, Oper. Res..

[100]  Aki Vehtari,et al.  A survey of Bayesian predictive methods for model assessment, selection and comparison , 2012 .

[101]  T. Gneiting,et al.  Combining probability forecasts , 2010 .

[102]  Bertrand Clarke,et al.  Bias-variance trade-off for prequential model list selection , 2011 .

[103]  A. Gelman,et al.  Pareto Smoothed Importance Sampling , 2015, J. Mach. Learn. Res..

[104]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[105]  Debdeep Pati,et al.  ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES. , 2011, Annals of statistics.

[106]  B. Park,et al.  Estimation of Kullback–Leibler Divergence by Local Likelihood , 2006 .

[107]  I. Castillo On Bayesian supremum norm contraction rates , 2013, 1304.1761.

[108]  Stephen G. Walker,et al.  A Decision Theoretic Approach to Model Averaging , 2001 .

[109]  V. Vovk Competitive On‐line Statistics , 2001 .

[110]  Adrian E. Raftery,et al.  Bayesian Model Averaging: A Tutorial , 2016 .

[111]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[112]  Mike West,et al.  Bayesian Predictive Synthesis: Forecast Calibration and Combination , 2017 .

[113]  J. Hosking,et al.  Parameter and quantile estimation for the generalized pareto distribution , 1987 .

[114]  Kerstin Vännman,et al.  Estimators Based on Order Statistics from a Pareto Distribution , 1976 .

[115]  Aki Vehtari,et al.  On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior , 2016, AISTATS.