Sequential Prediction of Individual Sequences Under General Loss Functions

We consider adaptive sequential prediction of arbitrary binary sequences when the performance is evaluated using a general loss function. The goal is to predict on each individual sequence nearly as well as the best prediction strategy in a given comparison class of (possibly adaptive) prediction strategies, called experts. By using a general loss function, we generalize previous work on universal prediction, forecasting, and data compression. However, here we restrict ourselves to the case when the comparison class is finite. For a given sequence, we define the regret as the total loss on the entire sequence suffered by the adaptive sequential predictor, minus the total loss suffered by the predictor in the comparison class that performs best on that particular sequence. We show that for a large class of loss functions, the minimax regret is either /spl theta/(log N) or /spl Omega/(/spl radic//spl Lscr/log N), depending on the loss function, where N is the number of predictors in the comparison class and/spl Lscr/ is the length of the sequence to be predicted. The former case was shown previously by Vovk (1990); we give a simplified analysis with an explicit closed form for the constant in the minimax regret formula, and give a probabilistic argument that shows this constant is the best possible. Some weak regularity conditions are imposed on the loss function in obtaining these results. We also extend our analysis to the case of predicting arbitrary sequences that take real values in the interval [0,1].

[1]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[2]  Thomas M. Cover,et al.  Behavior of sequential predictors of binary sequences , 1965 .

[3]  J. D. T. Oliveira,et al.  The Asymptotic Theory of Extreme Order Statistics , 1979 .

[4]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[5]  J. Mycielski A learning theorem for linear operators , 1988 .

[6]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[7]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[8]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[9]  Dean Phillips Foster Prediction in the Worst Case , 1991 .

[10]  P. Krishnan,et al.  Optimal prefetching via data compression , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[11]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[12]  David Haussler,et al.  HOW WELL DO BAYES METHODS WORK FOR ON-LINE PREDICTION OF {+- 1} VALUES? , 1992 .

[13]  Neri Merhav,et al.  Universal sequential learning and decision from individual data sequences , 1992, COLT '92.

[14]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[15]  Abraham Lempel,et al.  A sequential algorithm for the universal coding of finite memory sources , 1992, IEEE Trans. Inf. Theory.

[16]  Vladimir Vovk,et al.  Universal Forecasting Algorithms , 1992, Inf. Comput..

[17]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[18]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[19]  Thomas H. Chung,et al.  Approximate methods for sequential decision making using expert advice , 1994, COLT '94.

[20]  Neri Merhav,et al.  Optimal sequential probability assignment for individual sequences , 1994, IEEE Trans. Inf. Theory.

[21]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[22]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[23]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[24]  Erik Ordentlich,et al.  Universal portfolios with side information , 1996, IEEE Trans. Inf. Theory.

[25]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[26]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[27]  Yoram Singer,et al.  On‐Line Portfolio Selection Using Multiplicative Updates , 1998, ICML.

[28]  Nicolò Cesa-Bianchi,et al.  On Bayes Methods for On-Line Boolean Prediction , 1998, COLT '96.

[29]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[30]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[31]  Y. Shtarkov Fuzzy estimation of unknown source model for universal coding , 1998, 1998 Information Theory Workshop (Cat. No.98EX131).

[32]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[33]  Dianne P. O'Leary,et al.  The mathematics of information coding, extraction, and distribution , 1999 .

[34]  D. Haussler,et al.  Worst Case Prediction over Sequences under Log Loss , 1999 .