Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions

We consider on-line density estimation with a parameterized density from the exponential family. The on-line algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative log-likelihood of the example with respect to the current parameter of the algorithm. An off-line algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the on-line algorithm over the total loss of the best off-line parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.

[1]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[2]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[3]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[4]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[5]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[6]  Y. Censor,et al.  An iterative row-action method for interval convex programming , 1981 .

[7]  C. Morris Natural Exponential Families with Quadratic Variance Functions , 1982 .

[8]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[9]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[10]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[11]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[12]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[13]  Charles L. Byrne,et al.  General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis , 1990, IEEE Trans. Inf. Theory.

[14]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[15]  Dean Phillips Foster Prediction in the Worst Case , 1991 .

[16]  I. Csiszár Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .

[17]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[18]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[19]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[20]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[21]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[22]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[23]  Adrian F. M. Smith,et al.  Conjugate Parameterizations for Natural Exponential Families , 1995 .

[24]  Erik Ordentlich,et al.  Universal portfolios with side information , 1996, IEEE Trans. Inf. Theory.

[25]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[26]  T. Cover Universal Portfolios , 1996 .

[27]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[28]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[29]  Manfred K. Warmuth,et al.  How to use expert advice , 1997, JACM.

[30]  Vladimir Vovk,et al.  Competitive On-line Linear Regression , 1997, NIPS.

[31]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[32]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[33]  Andrew R. Barron,et al.  Minimax redundancy for the class of memoryless sources , 1997, IEEE Trans. Inf. Theory.

[34]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[35]  Kenji Yamanishi,et al.  A Decision-Theoretic Extension of Stochastic Complexity and Its Applications to Learning , 1998, IEEE Trans. Inf. Theory.

[36]  Manfred K. Warmuth,et al.  Learning algorithms for tracking changing concepts and an investigation into the error surfaces of single artificial neurons , 1998 .

[37]  Peter Gr Unwald The minimum description length principle and reasoning under uncertainty , 1998 .

[38]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[39]  A. Barron,et al.  Asymptotically minimax regret by Bayes mixtures , 1998, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).

[40]  Mark Herbster,et al.  Tracking the best regressor , 1998, COLT' 98.

[41]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[42]  Jürgen Forster,et al.  On Relative Loss Bounds in Generalized Linear Regression , 1999, FCT.

[43]  Manfred K. Warmuth,et al.  Relative loss bounds for single neurons , 1999, IEEE Trans. Neural Networks.

[44]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[45]  J. Neyman,et al.  INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[46]  Philip M. Long,et al.  On-line learning of linear functions , 2005, computational complexity.

[47]  A. Barron,et al.  Asymptotically minimax regret for exponential families , 2005 .