In many real-world classification problems the input contains a large number of potentially irrelevant features. This paper proposes a new Bayesian framework for determining the relevance of input features. This approach extends one of the most successful Bayesian methods for feature selection and sparse learning, known as Automatic Relevance Determination (ARD). ARD finds the relevance of features by optimizing the model marginal likelihood, also known as the evidence. We show that this can lead to overfitting. To address this problem, we propose Predictive ARD based on estimating the predictive performance of the classifier. While the actual leave-one-out predictive performance is generally very costly to compute, the expectation propagation (EP) algorithm proposed by Minka provides an estimate of this predictive performance as a side-effect of its iterations. We exploit this in our algorithm to do feature selection, and to select data points in a sparse Bayesian kernel classifier. Moreover, we provide two other improvements to previous algorithms, by replacing Laplace's approximation with the generally more accurate EP, and by incorporating the fast optimization algorithm proposed by Faul and Tipping. Our experiments show that our method based on the EP estimate of predictive performance is more accurate on test data than relevance determination by optimizing the evidence.
[1]
Ole Winther,et al.
Gaussian Processes for Classification: Mean-Field Algorithms
,
2000,
Neural Computation.
[2]
Michael E. Tipping.
The Relevance Vector Machine
,
1999,
NIPS.
[3]
Tom Minka,et al.
Expectation Propagation for approximate Bayesian inference
,
2001,
UAI.
[4]
Gunnar Rätsch,et al.
Soft Margins for AdaBoost
,
2001,
Machine Learning.
[5]
Yi Li,et al.
Bayesian automatic relevance determination algorithms for classifying gene expression data.
,
2002,
Bioinformatics.
[6]
Isabelle Guyon,et al.
An Introduction to Variable and Feature Selection
,
2003,
J. Mach. Learn. Res..
[7]
Geoffrey E. Hinton,et al.
Bayesian Learning for Neural Networks
,
1995
.
[8]
Michael E. Tipping,et al.
Analysis of Sparse Bayesian Learning
,
2001,
NIPS.
[9]
Ralf Herbrich,et al.
Bayes Point Machines: Estimating the Bayes Point in Kernel Space
,
1999
.
[10]
David J. C. MacKay,et al.
Bayesian Interpolation
,
1992,
Neural Computation.