Projective Inference in High-dimensional Problems: Prediction and Feature Selection

This paper discusses predictive inference and feature selection for generalized linear models with scarce but high-dimensional data. We argue that in many cases one can benefit from a decision theoretically justified two-stage approach: first, construct a possibly non-sparse model that predicts well, and then find a minimal subset of features that characterize the predictions. The model built in the first step is referred to as the \emph{reference model} and the operation during the latter step as predictive \emph{projection}. The key characteristic of this approach is that it finds an excellent tradeoff between sparsity and predictive accuracy, and the gain comes from utilizing all available information including prior and that coming from the left out features. We review several methods that follow this principle and provide novel methodological contributions. We present a new projection technique that unifies two existing techniques and is both accurate and fast to compute. We also propose a way of evaluating the feature selection process using fast leave-one-out cross-validation that allows for easy and intuitive model size selection. Furthermore, we prove a theorem that helps to understand the conditions under which the projective approach could be beneficial. The benefits are illustrated via several simulated and real world examples.

[1]  Nicholas G. Polson,et al.  The Horseshoe+ Estimator of Ultra-Sparse Signals , 2015, 1502.00560.

[2]  C. O’Brien Statistical Learning with Sparsity: The Lasso and Generalizations , 2016 .

[3]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[4]  Daniel Hernández-Lobato,et al.  Expectation Propagation for microarray data classification , 2010, Pattern Recognit. Lett..

[5]  J. Berger,et al.  Optimal predictive model selection , 2004, math/0406464.

[6]  Aki Vehtari,et al.  Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC , 2015, Statistics and Computing.

[7]  C. Robert,et al.  Model choice in generalised linear models: A Bayesian approach via Kullback-Leibler projections , 1998 .

[8]  Aki Vehtari,et al.  Projection predictive model selection for Gaussian processes , 2015, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[9]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[10]  N. Narisetty,et al.  Bayesian variable selection with shrinking and diffusing priors , 2014, 1405.6545.

[11]  V. Johnson,et al.  Bayesian Model Selection in High-Dimensional Settings , 2012, Journal of the American Statistical Association.

[12]  David J. Nott,et al.  Computational Statistics and Data Analysis Bayesian Projection Approaches to Variable Selection in Generalized Linear Models , 2022 .

[13]  Aki Vehtari,et al.  Iterative Supervised Principal Components , 2017, AISTATS.

[14]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[15]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[16]  Udaya B. Kogalur,et al.  spikeslab: Prediction and Variable Selection Using Spike and Slab Regression , 2010, R J..

[17]  Aki Vehtari,et al.  Sparsity information and regularization in the horseshoe and other shrinkage priors , 2017, 1707.01694.

[18]  C. Carvalho,et al.  Decoupling Shrinkage and Selection in Bayesian Linear Models: A Posterior Summary Perspective , 2014, 1408.0464.

[19]  Yi Li,et al.  Bayesian automatic relevance determination algorithms for classifying gene expression data. , 2002, Bioinformatics.

[20]  James G. Scott,et al.  Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction , 2022 .

[21]  David B. Dunson,et al.  Generalized Beta Mixtures of Gaussians , 2011, NIPS.

[22]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[23]  A. Gelman,et al.  Using Stacking to Average Bayesian Predictive Distributions (with Discussion) , 2017, Bayesian Analysis.

[24]  R. Tibshirani,et al.  A Study of Error Variance Estimation in Lasso Regression , 2013, 1311.5274.

[25]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[26]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[27]  Aki Vehtari,et al.  Comparison of Bayesian predictive methods for model selection , 2015, Stat. Comput..

[28]  Zoubin Ghahramani,et al.  Compact approximations to Bayesian predictive distributions , 2005, ICML.

[29]  Aki Vehtari,et al.  On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior , 2016, AISTATS.

[30]  N. Pillai,et al.  Dirichlet–Laplace Priors for Optimal Shrinkage , 2014, Journal of the American Statistical Association.

[31]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[32]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[33]  D. Lindley The Choice of Variables in Multiple Regression , 1968 .

[34]  R. Tibshirani,et al.  "Preconditioning" for feature selection and regression in high-dimensional problems , 2007, math/0703858.

[35]  Radford M. Neal,et al.  High Dimensional Classification with Bayesian Neural Networks and Dirichlet Diffusion Trees , 2006, Feature Extraction.

[36]  Van Der Vaart,et al.  The Horseshoe Estimator: Posterior Concentration around Nearly Black Vectors , 2014, 1404.0202.

[37]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[38]  Aki Vehtari,et al.  Very Good Importance Sampling , 2015 .

[39]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[40]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[41]  Tomi Peltola Local Interpretable Model-agnostic Explanations of Bayesian Predictive Models via Kullback-Leibler Projections , 2018, ArXiv.

[42]  Aad van der Vaart,et al.  Uncertainty Quantification for the Horseshoe (with Discussion) , 2016, 1607.01892.

[43]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[44]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[45]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[46]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[47]  Aki Vehtari,et al.  Visualization in Bayesian workflow , 2017, Journal of the Royal Statistical Society: Series A (Statistics in Society).

[48]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[49]  Aki Vehtari,et al.  A survey of Bayesian predictive methods for model assessment, selection and comparison , 2012 .

[50]  Paul-Christian Bürkner,et al.  brms: An R Package for Bayesian Multilevel Models Using Stan , 2017 .

[51]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[53]  I. Johnstone,et al.  Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences , 2004, math/0410088.

[54]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[55]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[56]  L. M. M.-T. Theory of Probability , 1929, Nature.

[57]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[58]  Aki Vehtari,et al.  Hierarchical Bayesian Survival Analysis and Projective Covariate Selection in Cardiovascular Event Risk Prediction , 2014, BMA@UAI.

[59]  David J. Nott,et al.  The predictive Lasso , 2010, Stat. Comput..

[60]  J. Berger Intrinsic Estimation , 2002 .

[61]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[62]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..

[63]  Christian P. Robert,et al.  Variable selection in qualitative models via an entropic explanatory power , 2003 .

[64]  James G. Scott,et al.  Handling Sparsity via the Horseshoe , 2009, AISTATS.