Regret bounds for prediction problems

We present a unified framework for reasoning about worst-case regret bounds for learning algorithms. This framework is based on the theory of duality of convex functions. It brings together results from computational learning theory and Bayesian statistics, allowing us to derive new proofs of known theorems, new theorems about known algorithms, and new algorithms. 1 The inference problem We are interested in the following kind of inference problem: on each time step t = 1 : : : T we must choose a prediction vector wt from a set of allowable predictions W . The interpretation of wt depends on the details of the problem, but for example wt might be our guess at the mean of a sequence of numbers or the coefficients of a linear regression. Then the loss function lt(w) is revealed to us, and we are penalized lt(wt). These penalties are additive, so our overall goal is to minimize PT t=1 lt(wt). Our choice of wt may depend on l1 : : : lt 1, and possibly on some additional prior information, but it may not depend on lt : : : lT . Many well-known inference problems, such as linear regression and estimation of mixture coefficients, are special cases of this one. To express one of these specific problems as an instance of our general inference problem, we will usually interpret the loss function lt as encoding both a training example and a criterion to be minimized: the location of the set of minima of lt encodes the training example, while the shape of lt encodes the cost of deviations in each direction. This double role for lt means that the loss function will usually change from step to step, even if we are always trying to minimize the same kind of errors. For example, if we wanted This research was sponsored in part by the DARPA HPKB program under contract F30602-97-1-0215. to estimate the mean of a population of numbers from a sample z1; z2; : : :, then lt(w) might be (w zt). This choice of lt encodes both the current training point zt and the fact that we are minimizing squared error. (See Figure 1 for more detail.) Or, if we were interested in a linear regression of yt on xt, lt(w) might be (yt w xt). This choice encodes both the current example (xt; yt) and the fact that we want to minimize the squared prediction error. Or, if we were trying to solve a mixture estimation problem, lt(w) might be ln(w pt), where w is the vector of mixture proportions and pt;i is the probability of the current training point under the ith model. (Here and below, the notation pt;i stands for the ith component of the vector pt.) This choice of loss function encodes properties of the current example as well as the fact that we want to maximize log-likelihood. We want to develop an algorithm for choosing a sequence of wts so as to minimize our total loss PT t=1 lt(wt), even if the sequence of loss functions lt is chosen by an adversary. Unfortunately this problem is impossible without further assumptions: for example, the adversary could choose loss functions with corners or discontinuities and make the losses of two predictions vt and wt arbitrarily different even if vt and wt were close together. So, we will make two basic simplifications. The first is that we will place restrictions on the form of the functions lt that the adversary may choose. The chief restrictions will be that lt is convex and that a measure of the amount of information contained in lt does not increase too quickly from trial to trial. The second simplification is that we will seek a relative loss bound rather than an absolute one. That is, we will define a comparison class U of predictions, and we will seek to minimize our regret PT t=1(lt(wt) lt(u)) versus the best predictor u 2 U . (Often we will take U = W , so that we are comparing our predictions to the best constant prediction. Sometimes, though, we will need to take U W in order to prove a bound.) Since u can be chosen post hoc, with knowledge of the loss functions lt, such a regret bound is a strong statement. The focus on regret instead of just loss is the chief place where our results differ from traditional statistical estimation theory. It is what allows us to handle sequences of loss functions that are too difficult to predict: our theorems will still hold, but since there will be no comparison u that has small loss, the theorems will not tell us much about our total loss PT t=1 lt(wt). Trial t 0 1 2 3 4 Prediction wt — 0 2 3 3 Training example zt — 4 5 3 8 Error type — Squared Squared Squared Squared Loss function lt(w) w (w 4) (w 5) (w 3) (w 8) Loss of wt — 16 9 0 25 Ttl loss of w1 : : : wt 0 16 25 25 50 Best constant u 4 Loss of u 16 0 1 1 16 Ttl loss of u 16 16 17 18 34 Ttl regret -16 0 8 7 16 Figure 1: An example of the MAP algorithm in action, trying to minimize sum of squared errors. The prediction at trial t is the mean of all examples up to trial t 1, while the comparison vector is the mean of all examples. Surprisingly, with only weak restrictions on lt and u, we will be able to prove bounds that are similar to the best possible average-case bounds (that is, bounds where lt is chosen by some fixed probability law). Our theorems will unify results from classical statistics (inference in exponential families and generalized linear models) with those from computational learning theory (weighted majority, aggregating algorithm, exponentiated gradient). This regret bound framework has been studied before in [LW92, KW97, KW96, Vov90, CBFH95] among others. Also, some of our results are similar to results from classical statistics such as the Cramer-Rao variance bound [SO91]. Our theorems are more general than each of these previous results in at least one of the following ways. First, they apply to more general classes of convex loss functions, including non-differentiable ones. Second, they apply to both online (i.e., bounded computation per example) and offline (unbounded computation) algorithms. Third, they apply to all sequences of loss functions, not just on average. Finally, they apply at all time steps, not just asymptotically. Our theorems are also less general than traditional statistical results in some ways. For example, while the Cramer-Rao bound requires differentiability of the loss functions, it does not require global convexity, just local convexity. All of our theorems will concern variations on the following simple and intuitively appealing algorithm, which takes as input the loss functions l1 : : : lt 1 observed on previous trials plus one additional loss function l0 which encodes our prior knowledge before the first trial. MAP ALGORITHM: Predict any