Evolution and generalization of a single neurone: I. Single-layer perceptron as seven statistical classifiers

Unlike many other investigations on this topic, the present one considers the non-linear single-layer perceptron (SLP) as a process in which the weights of the perceptron are increasing, and the cost function of the sum of squares is changing gradually. During the backpropagation training, the decision boundary of of SLP becomes identical or close to that of seven statistical classifiers: (1) the Euclidean distance classifier, (2) the regularized linear discriminant analysis, (3) the standard Fisher linear discriminant function, (4) the Fisher linear discriminant function with a pseudoinverse covariance matrix, (5) the generalized Fisher discriminant function, (6) the minimum empirical error classifier, and (7) the maximum margin classifier. In order to obtain a wider range of classifiers, five new complexity-control techniques are proposed: target value control, moving of the learning data centre into the origin of coordinates, zero weight initialization, use of an additional negative weight decay term called "anti-regularization", and use of an exponentially increasing learning step. Which particular type of classifier will be obtained depends on the data, the cost function to be minimized, the optimization technique and its parameters, and the stopping criteria.

[1]  Wah-Chun Chan,et al.  An optimal algorithm for pattern classification , 1971 .

[2]  R. Duin Small sample size generalization , 1995 .

[3]  R. E. Warmack,et al.  An Algorithm for the Optimal Solution of Linear Inequalities and its Application to Pattern Recognition , 1973, IEEE Transactions on Computers.

[4]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[5]  J. S. Koford,et al.  The use of an adaptive threshold element to design a linear optimal pattern classifier , 1966, IEEE Trans. Inf. Theory.

[6]  A. C. Wolff The estimation of the optimum linear decision function with a sequential random method , 1966, IEEE Trans. Inf. Theory.

[7]  J. Friedman Regularized Discriminant Analysis , 1989 .

[8]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[9]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[10]  Bernard Widrow,et al.  Adaptive switching circuits , 1988 .

[11]  You-yen. Yang Classification into two multivariate normal distributions with different covariance matrices , 1965 .

[12]  Di Pillo,et al.  Biased discriminant analysis : evaluation of the optimum probability of misclassification , 1979 .

[13]  R. Randles,et al.  Generalized Linear and Quadratic Discriminant Functions Using Robust Estimates , 1978 .

[14]  Hai Do-Tu,et al.  Learning Algorithms for Nonparametric Solution to the Minimum Error Classification Problem , 1978, IEEE Transactions on Computers.

[15]  Norman M. Abramson,et al.  Learning to recognize patterns in a random environment , 1962, IRE Trans. Inf. Theory.

[16]  D. W. Peterson,et al.  A method of finding linear discriminant functions for a class of performance criteria , 1966, IEEE Trans. Inf. Theory.

[17]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[18]  Barbara J. Bulmahn Ridge regression : biased estimation based on ill-conditioned data , 1979 .

[19]  Peter C. Jurs,et al.  Iterative least squares development of discriminant functions for spectroscopic data analysis by pattern recognition , 1972, Pattern Recognit..

[20]  Daniel G. Keehn,et al.  A note on learning for Gaussian properties , 1965, IEEE Trans. Inf. Theory.

[21]  G. Stanke G. Meyer-Brötz/J. Schürmann, Methoden der automatischen Zeichenerkennung. 154 S. m. 65 Abb. u. 8 Tab. München/Wien 1970. R. Oldenbourg Verlag. Preis brosch. DM 28,— , 1972 .

[22]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[23]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[24]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[25]  S. Geisser Posterior Odds for Multivariate Normal Classifications , 1964 .

[26]  Toshihide Ibaraki,et al.  Adaptive Linear Classifier by Linear Programming , 1970, IEEE Trans. Syst. Sci. Cybern..

[27]  Stephen S. Yau,et al.  Design of Pattern Classifiers with the Updating Property Using Stochastic Approximation Techniques , 1968, IEEE Transactions on Computers.

[28]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[29]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[30]  Herbert A. Glucksman On the Improvement of a Linear Separation by Extending the Adaptive Process with a Stricter Criterion , 1966, IEEE Trans. Electron. Comput..

[31]  Pasquale J. Dipillo Biased discriminant analysis: Evaluation of the optimum probability of misclassification , 1979 .

[32]  Arjun K. Gupta On the Equivalence of two Classification Rules , 1977 .

[33]  Akihiko Miyake,et al.  Mathematical aspects of optimal linear discriminant function , 1979, COMPSAC.

[34]  T. Snijders Multivariate Statistics and Matrices in Statistics , 1995 .

[35]  T. W. Anderson Classification by multivariate analysis , 1951 .