HOW WELL DO BAYES METHODS WORK FOR ON-LINE PREDICTION OF {+- 1} VALUES?

We look at sequential classification and regression problems in which {+- 1}-labeled instances are given on-line, one at a time, and for each new instance, before seeing the label, the learning system must either predict the label, or estimate the probability that the label is +1. We look at the performance of Bayes method for this task, as measured by the total number of mistakes for the classification problem, and by the total log loss (or information gain) for the regression problem. Our results are given by comparing the performance of Bayes method to the performance of a hypothetical "omniscient scientist" who is able to use extra information about the labeling process that would not be available in the standard learning protocol. The results show that Bayes methods perform only slightly worse than the omniscient scientist in many cases. These results generalize previous results of Haussler, Kearns and Schapire, and Opper and Haussler.

[1]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory, Ser. A.

[2]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[3]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[4]  Thomas M. Cover,et al.  Open Problems in Communication and Computation , 2011, Springer New York.

[5]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[6]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.

[7]  David Haussler,et al.  Equivalence of models for polynomial learnability , 1988, COLT '88.

[8]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[9]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[10]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[11]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[12]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[13]  David Haussler,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[14]  Opper,et al.  Generalization performance of Bayes optimal classification algorithm for learning a perceptron. , 1991, Physical review letters.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Kenji Yamanishi A loss bound model for on-line stochastic prediction strategies , 1991, COLT '91.

[17]  David Haussler,et al.  Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise , 1991, COLT '91.

[18]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[19]  山西 健司 A statistical approach to computational learning theory , 1992 .

[20]  A. P. Dawid,et al.  Prequential data analysis , 1992 .

[21]  Neri Merhav,et al.  Universal sequential learning and decision from individual data sequences , 1992, COLT '92.

[22]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[23]  Shun-ichi Amari,et al.  A universal theorem on learning curves , 1993, Neural Networks.

[24]  Neri Merhav,et al.  Relations between entropy and error probability , 1994, IEEE Trans. Inf. Theory.