Tight worst-case loss bounds for predicting with expert advice

We consider on-line algorithms for predicting binary or continuous-valued outcomes, when the algorithm has available the predictions made by N experts. For a sequence of trials, we compute total losses for both the algorithm and the experts under a loss function. At the end of the trial sequence, we compare the total loss of the algorithm to the total loss of the best expert, i.e., the expert with the least loss on the particular trial sequence. We show that for a large class of loss functions, with binary outcomes the total loss of the algorithm proposed by Vovk exceeds the total loss of the best expert at most by the amount c ln N, where c is a constant determined by the loss function. This upper bound does not depend on any assumptions on how the experts'' predictions or the outcomes are generated, and the trial sequence can be arbitrarily long. We give a straightforward method for finding the correct value c and show by a lower bound that for this value of c, the upper bound is asymptotically tight. The lower bound is based on a probabilistic adversary argument. The class of loss functions for which the c ln N upper bound holds includes the square loss, the logarithmic loss, and the Hellinger loss. We also consider another class of loss functions, including the absolute loss, for which we have an Omega((l log N)^(1/2)) lower bound, where l is the number of trials. We show that for the square and logarithmic loss functions, Vovk''s algorithm achieves the same worst-case upper bounds with continuous-valued outcomes as with binary outcomes. For the absolute loss, we show how bounds earlier achieved for binary outcomes can be achieved with continuous-valued outcomes using a slightly more complicated algorithm.

[1]  Thomas M. Cover,et al.  Behavior of sequential predictors of binary sequences , 1965 .

[2]  J. D. T. Oliveira,et al.  The Asymptotic Theory of Extreme Order Statistics , 1979 .

[3]  J. Mycielski A learning theorem for linear operators , 1988 .

[4]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[5]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[6]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[7]  Neri Merhav,et al.  Universal sequential learning and decision from individual data sequences , 1992, COLT '92.

[8]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[9]  Vladimir Vovk,et al.  Universal Forecasting Algorithms , 1992, Inf. Comput..

[10]  David Haussler,et al.  How to use expert advice , 1993, STOC '93.

[11]  Manfred K. Warmuth,et al.  Using experts for predicting continuous outcomes , 1994, EuroCOLT.

[12]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[13]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[14]  Thomas H. Chung,et al.  Approximate methods for sequential decision making using expert advice , 1994, COLT '94.

[15]  Neri Merhav,et al.  Optimal sequential probability assignment for individual sequences , 1994, IEEE Trans. Inf. Theory.

[16]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..