Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model

Identifying small subsets of features that are relevant for prediction and classification tasks is a central problem in machine learning and statistics. The feature selection task is especially important, and computationally difficult, for modern data sets where the number of features can be comparable to or even exceed the number of samples. Here, we show that feature selection with Bayesian inference takes a universal form and reduces to calculating the magnetizations of an Ising model under some mild conditions. Our results exploit the observation that the evidence takes a universal form for strongly regularizing priors—priors that have a large effect on the posterior probability even in the infinite data limit. We derive explicit expressions for feature selection for generalized linear models, a large class of statistical techniques that includes linear and logistic regression. We illustrate the power of our approach by analyzing feature selection in a logistic regression-based classifier trained to distinguish between the letters B and D in the notMNIST data set.

[1]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[2]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[3]  M. Opper,et al.  From Naive Mean Field Theory to the TAP Equations , 2001 .

[4]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[5]  Byron Hall Bayesian Inference , 2011 .

[6]  L. Tippett,et al.  Applied Statistics. A Journal of the Royal Statistical Society , 1952 .

[7]  L. Goddard Information Theory , 1962, Nature.

[8]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[9]  Charles K. Fisher,et al.  Bayesian feature selection for high-dimensional linear regression via the Ising approximation with applications to genomics , 2015, Bioinform..

[10]  Justin B. Kinney,et al.  Parametric inference in the large data limit using maximally informative models , 2012, bioRxiv.

[11]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[12]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[13]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[14]  M. Opper,et al.  Advanced mean field methods: theory and practice , 2001 .

[15]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[16]  William Bialek,et al.  Occam factors and model-independent Bayesian learning of continuous distributions , 2000, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .