On-Line Regression Competitive with Reproducing Kernel Hilbert Spaces

We consider the problem of on-line prediction of real-valued labels, assumed bounded in absolute value by a known constant, of new objects from known labeled objects. The prediction algorithm’s performance is measured by the squared deviation of the predictions from the actual labels. No stochastic assumptions are made about the way the labels and objects are generated. Instead, we are given a benchmark class of prediction rules some of which are hoped to produce good predictions. We show that for a wide range of infinite-dimensional benchmark classes one can construct a prediction algorithm whose cumulative loss over the first N examples does not exceed the cumulative loss of any prediction rule in the class plus $O(\sqrt{N})$; the main differences from the known results are that we do not impose any upper bound on the norm of the considered prediction rules and that we achieve an optimal leading term in the excess loss of our algorithm. If the benchmark class is “universal” (dense in the class of continuous functions on each compact set), this provides an on-line non-stochastic analogue for universally consistent prediction in non-parametric statistics. We use two proof techniques: one is based on the Aggregating Algorithm and the other on the recently developed method of defensive forecasting.

[1]  David M. Miller,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[2]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[3]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[4]  S. Smale,et al.  Shannon sampling II: Connections to learning theory , 2005 .

[5]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[6]  Alvaro Sandroni,et al.  The reproducible properties of correct forecasts , 2003, Int. J. Game Theory.

[7]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[8]  Vladimir Vovk,et al.  Competing with wild prediction rules , 2005, Machine Learning.

[9]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[10]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[11]  Jürgen Forster,et al.  On Relative Loss Bounds in Generalized Linear Regression , 1999, FCT.

[12]  Claudio Gentile,et al.  A Second-Order Perceptron Algorithm , 2002, SIAM J. Comput..

[13]  E. Lehrer Any Inspection is Manipulable , 2001 .

[14]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[15]  Christine Thomas-Agnan,et al.  Computing a family of reproducing kernels for statistical applications , 1996, Numerical Algorithms.

[16]  Bernhard Schölkopf,et al.  Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators , 1998 .

[17]  Herbert Meschkowski,et al.  Hilbertsche Räume mit Kernfunktion , 1962 .

[18]  G. Shafer,et al.  Probability and Finance: It's Only a Game! , 2001 .

[19]  Philip M. Long Improved bounds about on-line learning of smooth-functions of a single variable , 2000, Theor. Comput. Sci..

[20]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[21]  V. Vovk Competitive On‐line Statistics , 2001 .

[22]  Sham M. Kakade,et al.  Worst-Case Bounds for Gaussian Process Models , 2005, NIPS.

[23]  B. Harshbarger An Introduction to Probability Theory and its Applications, Volume I , 1958 .

[24]  Peter L. Bartlett,et al.  The Importance of Convexity in Learning with Squared Loss , 1998, IEEE Trans. Inf. Theory.

[25]  Tong Zhang,et al.  Leave-One-Out Bounds for Kernel Methods , 2003, Neural Computation.

[26]  D. S. F. Crothers,et al.  Asymptotic expansions for parabolic cylinder functions of large order and argument , 1972 .

[27]  Alvaro Sandroni,et al.  Calibration with Many Checking Rules , 2003, Math. Oper. Res..

[28]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[29]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2006 .

[30]  Vladimir Vovk Defensive Prediction with Expert Advice , 2005, ALT.

[31]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[32]  Philip M. Long,et al.  On-Line Learning of Smooth Functions of a Single Variable , 1995, Theor. Comput. Sci..

[33]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[34]  Dean Phillips Foster Prediction in the Worst Case , 1991 .

[35]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[36]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[37]  Peter L. Bartlett,et al.  The importance of convexity in learning with squared loss , 1998, COLT '96.

[38]  G. Wahba Spline models for observational data , 1990 .

[39]  Vladimir Vovk Non-asymptotic calibration and resolution , 2007, Theor. Comput. Sci..

[40]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[41]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[42]  T. Poggio,et al.  Regression and Classification with Regularization , 2003 .

[43]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[44]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[45]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[46]  Claudio Gentile,et al.  Improved Risk Tail Bounds for On-Line Algorithms , 2005, IEEE Transactions on Information Theory.

[47]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[48]  G. Shafer,et al.  Good randomized sequential probability forecasting is always possible , 2005 .

[49]  D. Fudenberg,et al.  Conditional Universal Consistency , 1999 .

[50]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[51]  W. C. Taylor A Complete Set of Asymptotic Formulas for the Whittaker Function and the LaGuerre Polynomials , 1939 .

[52]  John Langford,et al.  Beating the hold-out: bounds for K-fold and progressive cross-validation , 1999, COLT '99.

[53]  A. Erdélyi,et al.  Higher Transcendental Functions , 1954 .

[54]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[55]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[56]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[57]  J. T. Marti Evaluation of the Least Constant in Sobolev’s Inequality for $H^1 (0,s)$ , 1983 .

[58]  Alexander Gammerman,et al.  On-line Prediction with Kernels and the Complexity Approximation Principle , 2004, UAI.

[59]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[60]  Akimichi Takemura,et al.  Defensive Forecasting , 2005, AISTATS.

[61]  Péter Gács,et al.  Uniform test of algorithmic randomness over a general space , 2003, Theor. Comput. Sci..

[62]  Vladimir Vovk Competitive on-line learning with a convex loss function , 2005, ArXiv.

[63]  Chong Gu Smoothing Spline Anova Models , 2002 .

[64]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[65]  L. Györfi,et al.  A Distribution-Free Theory of Nonparametric Regression (Springer Series in Statistics) , 2002 .

[66]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[67]  Bernhard Schölkopf,et al.  The connection between regularization operators and support vector kernels , 1998, Neural Networks.

[68]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[69]  Par N. Aronszajn La théorie des noyaux reproduisants et ses applications Première Partie , 1943, Mathematical Proceedings of the Cambridge Philosophical Society.

[70]  Sham M. Kakade,et al.  Deterministic calibration and Nash equilibrium , 2004, J. Comput. Syst. Sci..

[71]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[72]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .