A Second-Order Perceptron Algorithm

Kernel-based linear-threshold algorithms, such as support vector machines and Perceptron-like algorithms, are among the best available techniques for solving pattern classification problems. In this paper, we describe an extension of the classical Perceptron algorithm, called second-order Perceptron, and analyze its performance within the mistake bound model of on-line learning. The bound achieved by our algorithm depends on the sensitivity to second-order data information and is the best known mistake bound for (efficient) kernel-based linear-threshold classifiers to date. This mistake bound, which strictly generalizes the well-known Perceptron bound, is expressed in terms of the eigenvalues of the empirical data correlation matrix and depends on a parameter controlling the sensitivity of the algorithm to the distribution of these eigenvalues. Since the optimal setting of this parameter is not known a priori, we also analyze two variants of the second-order Perceptron algorithm: one that adaptively sets the value of the parameter in terms of the number of mistakes made so far, and one that is parameterless, based on pseudoinverses.

[1]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[2]  H. D. Block The perceptron: a model for brain functioning. I , 1962 .

[3]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[4]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[5]  M. Marcus,et al.  Introduction to linear algebra , 1965 .

[6]  David G. Stork,et al.  Pattern Classification , 1973 .

[7]  Adi Ben-Israel,et al.  Generalized inverses: theory and applications , 1974 .

[8]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[9]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[10]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[11]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[12]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[13]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[14]  Nick Littlestone,et al.  Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow , 1991, COLT '91.

[15]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[16]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine-mediated learning.

[17]  Tracking the best disjunction , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[18]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[19]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[20]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[21]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[22]  Claudio Gentile,et al.  Linear Hinge Loss and Average Margin , 1998, NIPS.

[23]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[24]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[25]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[26]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[27]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[28]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[29]  Peter Auer,et al.  Using upper confidence bounds for online learning , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[30]  Manfred K. Warmuth,et al.  Relative Expected Instantaneous Loss Bounds , 2000, J. Comput. Syst. Sci..

[31]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[32]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[33]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..

[34]  V. Vovk Competitive On‐line Statistics , 2001 .

[35]  Mark Herbster,et al.  Tracking the Best Linear Predictor , 2001, J. Mach. Learn. Res..

[36]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[37]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machines , 2002 .

[38]  T. Poggio,et al.  Chapter 7 Regularized Least-Squares Classification , 2003 .

[39]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[40]  Yi Li,et al.  The Relaxed Online Maximum Margin Algorithm , 1999, Machine Learning.

[41]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[42]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[43]  Jing Peng,et al.  SVM vs regularized least squares classification , 2004, ICPR 2004.

[44]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Temporal-Difference Learning , 2000, Machine Learning.

[45]  Nicolò Cesa-Bianchi,et al.  On-line Prediction and Conversion Strategies , 1994, Machine Learning.