Bayesian Variable Selection in Linear Regression in One Pass for Large Datasets

Bayesian models are generally computed with Markov Chain Monte Carlo (MCMC) methods. The main disadvantage of MCMC methods is the large number of iterations they need to sample the posterior distributions of model parameters, especially for large datasets. On the other hand, variable selection remains a challenging problem due to its combinatorial search space, where Bayesian models are a promising solution. In this work, we study how to accelerate Bayesian model computation for variable selection in linear regression. We propose a fast Gibbs sampler algorithm, a widely used MCMC method that incorporates several optimizations. We use a Zellner prior for the regression coefficients, an improper prior on variance, and a conjugate prior Gaussian distribution, which enable dataset summarization in one pass, thus exploiting an augmented set of sufficient statistics. Thereafter, the algorithm iterates in main memory. Sufficient statistics are indexed with a sparse binary vector to efficiently compute matrix projections based on selected variables. Discovered variable subsets probabilities, selecting and discarding each variable, are stored on a hash table for fast retrieval in future iterations. We study how to integrate our algorithm into a Database Management System (DBMS), exploiting aggregate User-Defined Functions for parallel data summarization and stored procedures to manipulate matrices with arrays. An experimental evaluation with real datasets evaluates accuracy and time performance, comparing our DBMS-based algorithm with the R package. Our algorithm is shown to produce accurate results, scale linearly on dataset size, and run orders of magnitude faster than the R package.

[1]  Jerry Nedelman,et al.  Book review: “Bayesian Data Analysis,” Second Edition by A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin Chapman & Hall/CRC, 2004 , 2005, Comput. Stat..

[2]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[3]  N. Zhang,et al.  Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces With Applications in Genomics , 2010 .

[4]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[5]  Veerabhadran Baladandayuthapani,et al.  Spatially Adaptive Bayesian Penalized Regression Splines (P-splines) , 2005 .

[6]  Jean-Michel Marin,et al.  Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation , 2010, 1010.0300.

[7]  Carlos Ordonez,et al.  Statistical Model Computation with UDFs , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8]  Carlos Ordonez,et al.  On the Computation of Stochastic Search Variable Selection in Linear Regression with UDFs , 2010, 2010 IEEE International Conference on Data Mining.

[9]  David Madigan,et al.  Bayesian analysis of massive datasets via particle filters , 2002, KDD.

[10]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[11]  Jean-Michel Marin,et al.  Bayesian Core: A Practical Approach to Computational Bayesian Statistics , 2010 .

[12]  M. Clyde,et al.  Mixtures of g Priors for Bayesian Variable Selection , 2008 .

[13]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[14]  Carlos Garcia-Alvarado,et al.  Fast PCA computation in a DBMS with aggregate UDFs and LAPACK , 2012, CIKM.

[15]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .