Learning Additive Models Online with Fast Evaluating Kernels

We develop three new techniques to build on the recent advances in online learning with kernels. First,w e show that an exponential speed-up in prediction time per trial is possible for such algorithms as the Kernel-Adatron,the Kernel-Perceptron,and ROMMA for specific additive models. Second,w e show that the techniques of the recent algorithms developed for online linear prediction when the best predictor changes over time may be implemented for kernel-based learners at no additional asymptotic cost. Finally,w e introduce a new online kernel-based learning algorithm for which we give worst-case loss bounds for the Ɛ-insensitive square loss.

[1]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[2]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[3]  J. Neumann Functional Operators (AM-22), Volume 2: The Geometry of Orthogonal Spaces. (AM-22) , 1951 .

[4]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[5]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[6]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[7]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[8]  W. Rudin Real and complex analysis , 1968 .

[9]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[10]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[11]  W. Rudin Real and complex analysis, 3rd ed. , 1987 .

[12]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[13]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[14]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[15]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[16]  Dean Phillips Foster Prediction in the Worst Case , 1991 .

[17]  Jan Mycielski,et al.  Application of learning theorems , 1991, Fundam. Informaticae.

[18]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[19]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[20]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine-mediated learning.

[21]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[22]  Tracking the best disjunction , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[23]  Harris Drucker,et al.  Comparison of learning algorithms for handwritten digit recognition , 1995 .

[24]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[25]  Heinz H. Bauschke,et al.  On Projection Algorithms for Solving Convex Feasibility Problems , 1996, SIAM Rev..

[26]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[27]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[28]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[29]  Vladimir Vovk,et al.  Competitive On-line Linear Regression , 1997, NIPS.

[30]  Vladimir Vovk,et al.  Derandomizing Stochastic Prediction Strategies , 1997, COLT '97.

[31]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[32]  Avrim Blum,et al.  On-line Learning and the Metrical Task System Problem , 1997, COLT '97.

[33]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[34]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[35]  Nello Cristianini,et al.  The Kernel-Adatron Algorithm: A Fast and Simple Learning Procedure for Support Vector Machines , 1998, ICML.

[36]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[37]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[38]  Mark Herbster,et al.  Tracking the best regressor , 1998, COLT' 98.

[39]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[40]  Nello Cristianini,et al.  Further results on the margin distribution , 1999, COLT '99.

[41]  A. Buras,et al.  A general analysis of , 1998, hep-ph/9810260.

[42]  Manfred K. Warmuth,et al.  Relative loss bounds for single neurons , 1999, IEEE Trans. Neural Networks.

[43]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[44]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[45]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[46]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..