Bayesian Learning in Reproducing Kernel Hilbert Spaces

Support Vector Machines nd the hypothesis that corresponds to the centre of the largest hypersphere that can be placed inside version space, i.e. the space of all consistent hypotheses given a training set. The boundaries of version space touched by this hypersphere de ne the support vectors. An even more promising approach is to construct the hypothesis using the whole of version space. This is achieved by the Bayes point: the midpoint of the region of intersection of all hyperplanes bisecting version space into two volumes of equal magnitude. It is known that the centre of mass of version space approximates the Bayes point [30]. The centre of mass is estimated by averaging over the trajectory of a billiard in version space. We derive bounds on the generalisation error of Bayesian classi ers in terms of the volume ratio of version space and parameter space. This ratio serves as an e ective VC dimension and greatly in uences generalisation. We present experimental results indicating that Bayes Point Machines consistently outperform Support Vector Machines. Moreover, we show theoretically and experimentally how Bayes Point Machines can easily be extended to admit training errors.

[1]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[2]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[3]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[4]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[5]  Nello Cristianini,et al.  Bayesian Classifiers Are Large Margin Hyperplanes in a Hilbert Space , 1998, ICML.

[6]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  N. Cristianini,et al.  Robust Bounds on Generalization from the Margin Distribution , 1998 .

[9]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[10]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[11]  Radford M. Neal Markov Chain Monte Carlo Methods Based on `Slicing' the Density Function , 1997 .

[12]  Alkemade Pp,et al.  Playing Billiard in Version Space , 1997 .

[13]  T. Watkin Optimal Learning with a Neural Network , 1993 .

[14]  Michael Kearns,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[15]  Opper,et al.  Generalization performance of Bayes optimal classification algorithm for learning a perceptron. , 1991, Physical review letters.

[16]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[17]  M. Opper,et al.  On the ability of the optimal perceptron to generalise , 1990 .

[18]  G. Wahba Spline Models for Observational Data , 1990 .

[19]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[20]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[21]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[22]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[23]  C. Caramanis What is ergodic theory , 1963 .

[24]  B. Harshbarger An Introduction to Probability Theory and its Applications, Volume I , 1958 .

[25]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .