Large Scale Bayes Point Machines

The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires O(m2) of memory and O(N ċ m2) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%.

[1]  Ole Winther,et al.  Gaussian Processes for Classification: Mean-Field Algorithms , 2000, Neural Computation.

[2]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[3]  Radford M. Neal Markov Chain Monte Carlo Methods Based on `Slicing' the Density Function , 1997 .

[4]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[5]  Bernhard Schölkopf,et al.  Computing the Bayes Kernel Classifier , 2000 .

[6]  David A. McAllester Some PAC-Bayesian theorems , 1998, COLT' 98.

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[10]  T. Watkin Optimal Learning with a Neural Network , 1993 .

[11]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[12]  Christopher K. I. Williams Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond , 1999, Learning in Graphical Models.

[13]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[14]  Thore Graepel,et al.  The Kernel Gibbs Sampler , 2000, NIPS.

[15]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[16]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[17]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[18]  Thore Graepel,et al.  A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work , 2000, NIPS.

[19]  Colin Campbell,et al.  Robust Bayes Point Machines , 2000, ESANN.