Analysis of two gradient-based algorithms for on-line regression

In this paper we present a new analysis of two algorithms, Gradient Descent and Exponentiated Gradient, for solving regression problems in the on-line framework. Both these algorithms compute a prediction that depends linearly on the current instance, and then update the coefficients of this linear combination according to the gradient of the loss function. However, the two algorithms have distinctive ways of using the gradient information for updating the coefficients. For each algorithm, we show general regression bounds for any convex loss function. Furthermore, we show special bounds for the absolute and the square loss functions, thus extending previous results by Kivinen and Warmuth. In the nonlinear regression case, we show general bounds for pairs of transfer and loss functions satisfying a certain condition. We apply this result to the Hellinger loss and the entropic loss in case of logistic regression (similar results, but only for the entropic loss, were also obtained by Helmbold et al. using a different analysis.) Finally, we describe the connection between our approach and a general family of gradient-based

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  D. Pollard Convergence of stochastic processes , 1984 .

[3]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[4]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[5]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[6]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[9]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[10]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[11]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[12]  Manfred K. Warmuth,et al.  Using experts for predicting continuous outcomes , 1994, European Conference on Computational Learning Theory.

[13]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[14]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[15]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[16]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[17]  Manfred K. Warmuth,et al.  How to use expert advice , 1997, JACM.

[18]  Vladimir Vovk,et al.  Competitive On-line Linear Regression , 1997, NIPS.

[19]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT.

[20]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[21]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[22]  Tom Bylander,et al.  Worst-Case Absolute Loss Bounds for Linear Learning Algorithms , 1997, AAAI/IAAI.

[23]  Kenji Yamanishi,et al.  A Decision-Theoretic Extension of Stochastic Complexity and Its Applications to Learning , 1998, IEEE Trans. Inf. Theory.