Competitive on-line learning with a convex loss function

We consider the problem of sequential decision making under uncertainty in which the loss caused by a decision depends on the following binary observation. In competitive on-line learning, the goal is to design decision algorithms that are almost as good as the best decision rules in a wide benchmark class, without making any assumptions about the way the observations are generated. However, standard algorithms in this area can only deal with finite-dimensional (often countable) benchmark classes. In this paper we give similar results for decision rules ranging over an arbitrary reproducing kernel Hilbert space. For example, it is shown that for a wide class of loss functions (including the standard square, absolute, and log loss functions) the average loss of the master algorithm, over the first N observations, does not exceed the average loss of the best decision rule with a bounded norm plus O(N 1/2 ). Our proof technique is very different from the standard ones and is based on recent results about defensive forecasting. Given the probabilities produced by a defensive forecasting algorithm, which are known to be well calibrated and to have good resolution in the long run, we use the expected loss minimization principle to find a suitable decision.

[1]  Akimichi Takemura,et al.  Defensive Forecasting for Linear Protocols , 2005, ALT.

[2]  Philip M. Long,et al.  On-Line Learning of Smooth Functions of a Single Variable , 1995, Theor. Comput. Sci..

[3]  Baver Okutmustur Reproducing kernel Hilbert spaces , 2005 .

[4]  Philip M. Long Improved bounds about on-line learning of smooth-functions of a single variable , 2000, Theor. Comput. Sci..

[5]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[6]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[7]  Akimichi Takemura,et al.  Defensive Forecasting , 2005, AISTATS.

[8]  V. Vovk Competitive On‐line Statistics , 2001 .

[9]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[10]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[11]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[12]  Christine Thomas-Agnan,et al.  Computing a family of reproducing kernels for statistical applications , 1996, Numerical Algorithms.

[13]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[14]  Yuri Kalnishkan,et al.  The weak aggregating algorithm and weak mixability , 2005, J. Comput. Syst. Sci..

[15]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[16]  Yoav Freund Predicting a binary sequence almost as well as the optimal biased coin , 2003, Inf. Comput..

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[19]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[20]  Philip M. Long Improved Bounds about On-line Learning of Smooth Functions of a Single Variable , 1996, ALT.

[21]  Vladimir Vovk Non-asymptotic calibration and resolution , 2007, Theor. Comput. Sci..

[22]  G. Shafer,et al.  Probability and Finance: It's Only a Game! , 2001 .