The Weighted Majority Algorithm

We study the construction of prediction algorithms in a situation in which a learner faces a sequence of trials, with a prediction to be made in each, and the goal of the learner is to make few mistakes. We are interested in the case where the learner has reason to believe that one of some pool of known algorithms will perform well, but the learner does not know which one. A simple and effective method, based on weighted voting, is introduced for constructing a compound algorithm in such a circumstance. We call this method the Weighted Majority Algorithm. We show that this algorithm is robust in the presence of errors in the data. We discuss various versions of the Weighted Majority Algorithm and prove mistake bounds for them that are closely related to the mistake bounds of the best algorithms of the pool. For example, given a sequence of trials, if there is an algorithm in the pool A that makes at most m mistakes then the Weighted Majority Algorithm will make at most c(log |A| + m) mistakes on that sequence, where c is fixed constant.

[2]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1988, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[3]  Carl H. Smith,et al.  Probability and Plurality for Aggregations of Learning Machines , 1987, Inf. Comput..

[4]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[5]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, COLT '89.

[6]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[7]  Manfred K. Warmuth,et al.  Learning Nested Differences of Intersection-Closed Concept Classes , 1989, COLT '89.

[8]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[9]  Leonard Pitt,et al.  Probabilistic inductive inference , 1989, JACM.

[10]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[11]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[12]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[13]  Wolfgang Maass,et al.  On-line learning with an oblivious environment and the power of randomization , 1991, COLT '91.

[14]  David Haussler,et al.  Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise , 1991, COLT '91.

[15]  Michael Kearns,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[16]  Thomas G. Dietterich Machine learning , 1996, CSUR.