Tracking the Best Linear Predictor

In most on-line learning research the total on-line loss of the algorithm is compared to the total loss of the best off-line predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of examples. Recently some work has been done where the predictor ut at each trial t is allowed to change with time, and the total on-line loss of the algorithm is compared to the sum of the losses of ut at each trial plus the total "cost" for shifting to successive predictors. This is to model situations in which the examples change over time, and different predictors from the comparison class are best for different segments of the sequence of examples. We call such bounds shifting bounds. They hold for arbitrary sequences of examples and arbitrary sequences of predictors.Naturally shifting bounds are much harder to prove. The only known bounds are for the case when the comparison class consists of a sequences of experts or boolean disjunctions. In this paper we develop the methodology for lifting known static bounds to the shifting case. In particular we obtain bounds when the comparison class consists of linear neurons (linear combinations of experts). Our essential technique is to project the hypothesis of the static algorithm at the end of each trial into a suitably chosen convex region. This keeps the hypothesis of the algorithm well-behaved and the static bounds can be converted to shifting bounds.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[3]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[4]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[5]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[6]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[7]  Y. Censor,et al.  An iterative row-action method for interval convex programming , 1981 .

[8]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[9]  J. Mycielski A learning theorem for linear operators , 1988 .

[10]  Charles L. Byrne,et al.  General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis , 1990, IEEE Trans. Inf. Theory.

[11]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[12]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[13]  I. Csiszár Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .

[14]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[15]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[16]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[17]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[18]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[20]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[21]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[22]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[23]  T. Cover Universal Portfolios , 1996 .

[24]  Yoram Singer,et al.  On‐Line Portfolio Selection Using Multiplicative Updates , 1998, ICML.

[25]  Vladimir Vovk,et al.  Derandomizing Stochastic Prediction Strategies , 1997, COLT '97.

[26]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[27]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[28]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[29]  Heinz H. Bauschke,et al.  Legendre functions and the method of random Bregman projections , 1997 .

[30]  Yoram Singer,et al.  Switching Portfolios , 1998, Int. J. Neural Syst..

[31]  Tom Bylander,et al.  The binary exponentiated gradient algorithm for learning linear functions , 1997, COLT '97.

[32]  Avrim Blum,et al.  On-line Learning and the Metrical Task System Problem , 1997, COLT '97.

[33]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[34]  Sally A. Goldman,et al.  Exploring applications of learning theory to pattern matching and dynamic adjustment of tcp acknowledgement delays , 1998 .

[35]  Mark Herbster,et al.  Tracking the best regressor , 1998, COLT' 98.

[36]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[37]  Adam Tauman Kalai,et al.  On-line algorithms for combining language models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[38]  Manfred K. Warmuth,et al.  Relative loss bounds for single neurons , 1999, IEEE Trans. Neural Networks.

[39]  Darrell D. E. Long,et al.  Adaptive disk spin‐down for mobile computers , 2000, Mob. Networks Appl..