Sparse Bayesian Learning and the Relevance Vector Machine

This paper introduces a general Bayesian framework for obtaining sparse solutions to regression and classification tasks utilising models linear in the parameters. Although this framework is fully general, we illustrate our approach with a particular specialisation that we denote the 'relevance vector machine' (RVM), a model of identical functional form to the popular and state-of-the-art 'support vector machine' (SVM). We demonstrate that by exploiting a probabilistic Bayesian learning framework, we can derive accurate prediction models which typically utilise dramatically fewer basis functions than a comparable SVM while offering a number of additional advantages. These include the benefits of probabilistic predictions, automatic estimation of 'nuisance' parameters, and the facility to utilise arbitrary basis functions (e.g. non-'Mercer' kernels). We detail the Bayesian framework and associated learning algorithm for the RVM, and give some illustrative examples of its application along with some comparative benchmarks. We offer some explanation for the exceptional degree of sparsity obtained, and discuss and demonstrate some of the advantageous features, and potential extensions, of Bayesian relevance learning.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[4]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[5]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[6]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[7]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[8]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[9]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[10]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[11]  Christopher J. C. Burges,et al.  Simplified Support Vector Decision Rules , 1996, ICML.

[12]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[13]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[14]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[15]  David J. C. MacKay,et al.  Bayesian Methods for Backpropagation Networks , 1996 .

[16]  Geoffrey E. Hinton,et al.  Evaluation of Gaussian processes and other methods for non-linear regression , 1997 .

[17]  Yves Grandvalet Least Absolute Shrinkage is Equivalent to Quadratic Penalization , 1998 .

[18]  Christopher K. I. Williams Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond , 1999, Learning in Graphical Models.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  Bernhard Schölkopf,et al.  The connection between regularization operators and support vector kernels , 1998, Neural Networks.

[21]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[22]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[24]  Peter Sollich,et al.  Probabilistic Methods for Support Vector Machines , 1999, NIPS.

[25]  Matthias W. Seeger,et al.  Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers , 1999, NIPS.

[26]  I. Nabney Efficient training of RBF networks for classification , 1999 .

[27]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[28]  B. Schölkopf,et al.  Linear programs for automatic accuracy control in regression. , 1999 .

[29]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[30]  David J. C. MacKay,et al.  Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[31]  Michael E. Tipping The Relevance Vector Machine , 1999, NIPS.

[32]  Christopher M. Bishop,et al.  Variational Relevance Vector Machines , 2000, UAI.

[33]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[34]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[35]  Michael E. Tipping Sparse Kernel Principal Component Analysis , 2000, NIPS.

[36]  James T. Kwok,et al.  The evidence framework applied to support vector machines , 2000, IEEE Trans. Neural Networks Learn. Syst..