Bayesian subset selection and variable importance for interpretable prediction and classification

Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often eschewed due to selection instability, computational bottlenecks, and lack of post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive modelM, we elicit predictively-competitive subsets using linear decision analysis. The approach is customizable for (local) prediction or classification and provides interpretable summaries of M. A key quantity is the acceptable family of subsets, which leverages the predictive distribution from M to identify subsets that offer nearly-optimal prediction. The acceptable family spawns new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. Crucially, the linear coefficients for any subset inherit regularization and predictive uncertainty quantification via M. The proposed approach exhibits excellent prediction, interval estimation, and variable selection for simulated data, including p = 400 > n. These tools are applied to a large education dataset with highly correlated covariates, where the acceptable family is especially useful. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and features highly competitive prediction with remarkable stability.

[1]  Daniel R. Kowal,et al.  Bayesian Function-on-Scalars Regression for High-Dimensional Data , 2018, Journal of Computational and Graphical Statistics.

[2]  Alan J. Miller Sélection of subsets of regression variables , 1984 .

[3]  Lucas Mentch,et al.  Forward Stability and Model Path Selection , 2021, 2103.03462.

[4]  Aki Vehtari,et al.  A decision-theoretic approach for model interpretability in Bayesian framework , 2019, Machine Learning.

[5]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[6]  Jared S. Murray,et al.  Model Interpretation Through Lower-Dimensional Posterior Summarization , 2019, J. Comput. Graph. Stat..

[7]  R. O’Hara,et al.  A review of Bayesian variable selection methods: what, how and which , 2009 .

[8]  J. S. Rao,et al.  Fence methods for mixed model selection , 2008, 0808.0985.

[9]  C. Carvalho,et al.  Decoupling Shrinkage and Selection in Bayesian Linear Models: A Posterior Summary Perspective , 2014, 1408.0464.

[10]  Trevor Hastie,et al.  Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons , 2020 .

[11]  Daniel R. Kowal,et al.  Bayesian variable selection for understanding mixtures in environmental exposures , 2021, Statistics in medicine.

[12]  Cynthia Rudin,et al.  Variable Importance Clouds: A Way to Explore Variable Importance for the Set of Good Models , 2019, ArXiv.

[13]  J. Berger,et al.  Optimal predictive model selection , 2004, math/0406464.

[14]  David Puelz,et al.  Variable Selection in Seemingly Unrelated Regressions with Random Predictors , 2016, 1605.08963.

[15]  Aki Vehtari,et al.  A survey of Bayesian predictive methods for model assessment, selection and comparison , 2012 .

[16]  Thomas J. Harris,et al.  The use of simplified or misspecified models : Linear case , 2008 .

[17]  Robert W. Wilson,et al.  Regressions by Leaps and Bounds , 2000, Technometrics.

[18]  Brian J Reich,et al.  Consistent High-Dimensional Bayesian Variable Selection via Penalized Credible Regions , 2012, Journal of the American Statistical Association.

[19]  Jayanta K. Ghosh,et al.  Asymptotic Properties of Bayes Risk for the Horseshoe Prior , 2013 .

[20]  Ali Shojaie,et al.  In Defense of the Indefensible: A Very Naïve Approach to High-Dimensional Inference. , 2017, Statistical science : a review journal of the Institute of Mathematical Statistics.

[21]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[22]  James G. Scott,et al.  Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction , 2022 .

[23]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[24]  J. Gabry,et al.  Bayesian Applied Regression Modeling via Stan , 2016 .

[25]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.

[26]  David W. Hosmer,et al.  Best subsets logistic regression , 1989 .

[27]  Florian Huber,et al.  Inducing Sparsity and Shrinkage in Time-Varying Parameter Models , 2019 .

[28]  D. Lindley The Choice of Variables in Multiple Regression , 1968 .

[29]  F. Liang,et al.  Bayesian Subset Modeling for High-Dimensional Generalized Linear Models , 2013 .

[30]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[31]  David J. Nott,et al.  Computational Statistics and Data Analysis Bayesian Projection Approaches to Variable Selection in Generalized Linear Models , 2022 .

[32]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[33]  Daniel R. Kowal Fast, Optimal, and Targeted Predictions Using Parameterized Decision Analysis , 2020, Journal of the American Statistical Association.

[34]  Cynthia Rudin,et al.  A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning , 2019, ArXiv.

[35]  Gonzalo García-Donato,et al.  Variable Selection in the Presence of Factors: A Model Selection Perspective , 2021, Journal of the American Statistical Association.

[36]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[37]  Erricos John Kontoghiorghes,et al.  A branch and bound algorithm for computing the best subset regression models , 2002 .

[38]  Hong Chang,et al.  Model Determination Using Predictive Distributions with Implementation via Sampling-Based Methods , 1992 .

[39]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[40]  Jing Lei,et al.  Cross-Validation With Confidence , 2017, Journal of the American Statistical Association.

[41]  P. Richard Hahn,et al.  Post-Processing Posteriors Over Precision Matrices to Produce Sparse Graph Estimates , 2019 .

[42]  Jingyu He,et al.  Efficient Sampling for Gaussian Linear Regression With Arbitrary Priors , 2018, Journal of Computational and Graphical Statistics.

[43]  Gyuhyeong Goh,et al.  Bayesian selection of best subsets via hybrid search , 2020, Comput. Stat..

[44]  Alessandro Rinaldo,et al.  Distribution-Free Predictive Inference for Regression , 2016, Journal of the American Statistical Association.

[45]  C. Robert,et al.  Model choice in generalised linear models: A Bayesian approach via Kullback-Leibler projections , 1998 .

[46]  David J. Nott,et al.  The predictive Lasso , 2010, Stat. Comput..

[47]  Trevor Campbell,et al.  Practical bounds on the error of Bayesian posterior approximations: A nonasymptotic approach , 2018, ArXiv.

[48]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.