Links between perceptrons, MLPs and SVMs

We propose to study links between three important classification algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin. These ideas are extended afterward to the case of MLPs. Moreover, under some assumptions it also appears that MLPs are a kind of mixture of SVMs, maximizing the margin in the hidden layer space. Finally, we present a very simple MLP based on the previous findings, which yields better performances in generalization and speed than the other models.

[1]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  Geoffrey E. Hinton,et al.  Experiments on Learning by Back Propagation. , 1986 .

[5]  Yann LeCun PhD thesis: Modeles connexionnistes de l'apprentissage (connectionist learning models) , 1987 .

[6]  R. Fletcher Practical Methods of Optimization , 1988 .

[7]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[8]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[9]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[10]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[11]  Thore Graepel,et al.  From Margin to Sparsity , 2000, NIPS.

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Peter Auer,et al.  Reducing Communication for Distributed Learning in Neural Networks , 2002, ICANN.

[14]  Ji Zhu,et al.  Margin Maximizing Loss Functions , 2003, NIPS.

[15]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[16]  Samy Bengio,et al.  A gentle Hessian for efficient gradient descent , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .