Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension

In this paper we study a Bayesian or average-case model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models.

[1]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[2]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory, Ser. A.

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  W. N. Wapnik,et al.  Theorie der Zeichenerkennung , 1979 .

[6]  P. Assouad Densité et dimension , 1983 .

[7]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[8]  R. Dudley A course on empirical processes , 1984 .

[9]  P. Massart Rates of convergence in the central limit theorem for empirical processes , 1986 .

[10]  Lawrence D. Jackel,et al.  Large Automatic Learning, Rule Extraction, and Generalization , 1987, Complex Syst..

[11]  M. Talagrand Donsker classes of sets , 1988 .

[12]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[13]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.

[14]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[15]  Naftali Tishby,et al.  Consistent inference of probabilities in layered networks: predictions and generalizations , 1989, International 1989 Joint Conference on Neural Networks.

[16]  Michael J. Pazzani,et al.  Average case analysis of empirical and explanation-based learning algorithms , 1989 .

[17]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[18]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[19]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[20]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[21]  Sompolinsky,et al.  Learning from examples in large neural networks. , 1990, Physical review letters.

[22]  Wray L. Buntine,et al.  A theory of learning classification rules , 1990 .

[23]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[24]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[25]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[26]  David Haussler,et al.  Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise , 1991, COLT '91.

[27]  Balas K. Natarajan,et al.  Probably Approximate Learning Over Classes of Distributions , 1992, SIAM J. Comput..

[28]  John Shawe-Taylor,et al.  Bounding Sample Size with the Vapnik-Chervonenkis Dimension , 1993, Discrete Applied Mathematics.

[29]  Michael J. Pazzani,et al.  A framework for average case analysis of conjunctive learning algorithms , 2004, Machine Learning.