A PAC analysis of a Bayesian estimator

Bayesian analysis of generalisation can place a prior distribution on the hypotheses and estimate the volume of this space that is consistent with the training data. The larger this volume the greater the confidence in the classifier obtained. The key feature of such estimators is that they provide a posteriori estimates of generalisation based on properties of the hypothesis and the training data. This contrasts with a ‘classical’ PAC analysis which provides only a priori (worst case) bounds. Following results in [26] showing that Datasensitive analysis of generalisation in the PAC sense is possible, the paper uses the techniques to give the first PAC style analysis of a Bayesian inspired estimator of generalisation. The estimator concerned is the size of a ball which can be placed in the consistent region of parameter space. The ball gives a lower bound on the volume of parameter space consistent with the training set. The larger the ball the better the bound on the generalisation obtained. In all cases the bounds are of good generalisation with high confidence, hence bounding the tail of the distribution of generalisation errors that might occur. The resulting bounds are independent of the complexity of the function class though they depend linearly on the dimensionality of the parameter space. Permission to make digitnl/hnrd copies ofnll or part ofthis mntcl-inl fol personal or clw.mnm use is gr:u’ted withaut I kc pwvidcd 11~1 lhc oop~s are 1101 made or distrihutrd For profit or comnwx~l advnm~gr. the copy right notice, the title of the puhlwxtion and its dnie appear. and n&t: is given that copyright is hy psmksion of~hr .Ac’M. lnc To copy othmvisc to republish. IO post on servers or to rrdistrilwtr 10 lists. rquires ~pcxilic pemission m&or fee COLT 97 Nashville. Tennesee. USA Copyright 1997 ACM O-89791~891~6197i7..%3.50 Robert C. Williamson Dept of Engineering Australian National University Canberra 0200 Australia Bob.WilliamsonQanu.edu.au

[1]  H. Jeffreys,et al.  Theory of probability , 1896 .

[2]  Stephen Spielman A Refutation of the Neyman-Pearson Theory of Testing , 1973, The British Journal for the Philosophy of Science.

[3]  D. Pierce On Some Difficulties in a Frequency Theory of Inference , 1973 .

[4]  William Harper,et al.  Foundations and philosophy of statistical inference , 1976 .

[5]  C. Hooker,et al.  Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science , 1976 .

[6]  E. Jaynes,et al.  Confidence Intervals vs Bayesian Intervals , 1976 .

[7]  Karl Raimund Sir Popper,et al.  Realism and the aim of science , 1983 .

[8]  Vladimir Vapnik,et al.  Inductive principles of the search for empirical dependences (methods based on weak convergence of probability measures) , 1989, COLT '89.

[9]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[10]  David J. C. MacKay,et al.  Bayesian Model Comparison and Backprop Nets , 1991, NIPS.

[11]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[12]  G. Casella Conditional inference from confidence sets , 1992 .

[13]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[14]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[15]  M. Opper,et al.  Perceptron learning: the largest version space , 1995 .

[16]  Manfred OPPERInstitut Perceptron Learning: the Largest Version Space , 1995 .

[17]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[18]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[19]  Gábor Lugosi,et al.  A data-dependent skeleton estimate for learning , 1996, COLT '96.

[20]  John Shawe-Taylor,et al.  A framework for structural risk minimisation , 1996, COLT '96.

[21]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[22]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[23]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[24]  Philip M. Long,et al.  Prediction, Learning, Uniform Convergence, and Scale-Sensitive Dimensions , 1998, J. Comput. Syst. Sci..