Exponentiated Gradient Versus Gradient Descent for Linear Predictors

We consider two algorithm for on-line prediction based on a linear model. The algorithms are the well-known Gradient Descent (GD) algorithm and a new algorithm, which we call EG(+/-). They both maintain a weight vector using simple updates. For the GD algorithm, the update is based on subtracting the gradient of the squared error made on a prediction. The EG(+/-) algorithm uses the components of the gradient in the exponents of factors that are used in updating the weight vector multiplicatively. We present worst-case loss bounds for EG(+/-) and compare them to previously known bounds for the GD algorithm. The bounds suggest that the losses of the algorithms are in general incomparable, but EG(+/-) has a much smaller loss if only a few components of the input are relevant for the predictions. We have performed experiments, which show that our worst-case upper bounds are quite tight already on simple artificial data.

[1]  H. Johnson,et al.  A comparison of 'traditional' and multimedia information systems development practices , 2003, Inf. Softw. Technol..

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[4]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[5]  Geoffrey E. Hinton Learning distributed representations of concepts. , 1989 .

[6]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[7]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[8]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[9]  Guy Jumarie,et al.  Relative Information — What For? , 1990 .

[10]  Nick Littlestone,et al.  Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow , 1991, COLT '91.

[11]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[12]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[13]  Manfred K. Warmuth,et al.  Some weak learning results , 1992, COLT '92.

[14]  R. Schapire Toward Eecient Agnostic Learning , 1992 .

[15]  J. N. Kapur,et al.  Entropy optimization principles with applications , 1992 .

[16]  Linda Sellie,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[17]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[18]  Manfred K. Warmuth,et al.  Using experts for predicting continuous outcomes , 1994, European Conference on Computational Learning Theory.

[19]  Philip M. Long,et al.  Worst-case quadratic loss bounds for a generalization of the Widrow-Hoff rule , 1993, COLT '93.

[20]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[21]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[22]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[23]  David Haussler,et al.  Tight worst-case loss bounds for predicting with expert advice , 1994, EuroCOLT.

[24]  Manfred K. Warmuth,et al.  A comparison of new and old algorithms for a mixture estimation problem , 1995, COLT '95.

[25]  Shun-ichi Amari,et al.  Information geometry of the EM and em algorithms for neural networks , 1995, Neural Networks.

[26]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[27]  Shun-ichi Amari,et al.  The EM Algorithm and Information Geometry in Neural Network Learning , 1995, Neural Computation.

[28]  Steve Rogers,et al.  Adaptive Filter Theory , 1996 .

[29]  Yoram Singer,et al.  On‐Line Portfolio Selection Using Multiplicative Updates , 1998, ICML.