Using and combining predictors that specialize

We study online learning algorithms that predict by com- bining the predictions of several subordinate prediction a lgorithms, sometimes called "experts." These simple algorithms belon g to the multiplicative weights family of algorithms. The performance of these algorithms degrades only logarithmically with the number of experts, making them particularly useful in applications where the number of experts is very large. However, in applications such as text categorization, it is often natural for some of the ex perts to abstain from making predictions on some of the instances. We show how to transform algorithms that assume that all experts are always awake to algorithms that do not require this assumption. We also show how to derive corresponding loss bounds. Our method is very general, and can be applied to a large family of online learning algorithms. We also give applications to various prediction models including decision graphs and "switching" experts.

[1]  R. Gallager Information Theory and Reliable Communication , 1968 .

[2]  Thomas M. Cover,et al.  Compound Bayes Predictors for Sequences with Apparent Markov Structure , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[4]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[5]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[6]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[7]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[8]  Abraham Lempel,et al.  A sequential algorithm for the universal coding of finite memory sources , 1992, IEEE Trans. Inf. Theory.

[9]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[10]  Neri Merhav,et al.  Some properties of sequential predictors for binary Markov sources , 1993, IEEE Trans. Inf. Theory.

[11]  Dana Ron,et al.  Learning probabilistic automata with variable memory length , 1994, COLT '94.

[12]  Robert E. Schapire,et al.  Predicting Nearly As Well As the Best Pruning of a Decision Tree , 1995, COLT '95.

[13]  David Haussler,et al.  Tight worst-case loss bounds for predicting with expert advice , 1994, EuroCOLT.

[14]  Avrim Blum,et al.  Empirical Support for Winnow and Weighted-Majority Based Algorithms: Results on a Calendar Scheduling Domain , 1995, ICML.

[15]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[16]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine-mediated learning.

[17]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[18]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[19]  T. Cover Universal Portfolios , 1996 .

[20]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.