Differential learning leads to efficient neural network classifiers

The authors outline a differential theory of learning for statistical pattern classification. The theory is based on classification figure-of-merit (CFM) objective functions, described by J. P. Hampshire II and A. H. Waibel (IEEE Trans. Neural Netw. vol.1, no.2, p.216-218, June 1990). They give the proof that differential learning is efficient, requiring the least classifier complexity and the smallest training sample size necessary to achieve Bayesian (i.e., minimum error) discrimination. A practical application of the theory is included in which a simple differentially trained linear neural network classifier discriminations handwritten digits of the AT&T DB1 database with a 1.3% error rate. This error rate is less than one half of the best previous result for a linear classifier on this optical character recognition (OCR) task.<<ETX>>

[1]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[2]  Etienne Barnard,et al.  A comparison between criterion functions for linear classifiers, with an application to neural nets , 1989, IEEE Trans. Syst. Man Cybern..

[3]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[4]  H. Gish A minimum classification error, maximum likelihood, neural network , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  B.V.K. Vijaya Kumar,et al.  Why error measures are sub-optimal for training neural network pattern classifiers , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[6]  Barak A. Pearlmutter,et al.  Equivalence Proofs for Multi-Layer Perceptron Classifiers and the Bayesian Discriminant Function , 1991 .

[7]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[8]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[10]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[11]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[12]  Isabelle Guyon,et al.  Structural Risk Minimization for Character Recognition , 1991, NIPS.

[13]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[14]  Amro El-Jaroudi,et al.  A new error criterion for posterior probability estimation with neural nets , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[15]  B. Natarajan Machine Learning: A Theoretical Approach , 1992 .

[16]  Geoffrey E. Hinton 20 – CONNECTIONIST LEARNING PROCEDURES1 , 1990 .

[17]  Alexander H. Waibel,et al.  A novel objective function for improved phoneme recognition using time delay neural networks , 1990, International 1989 Joint Conference on Neural Networks.

[18]  B. V. K. Vijaya Kumar,et al.  Shooting Craps in Search of an Optimal Strategy for Training Connectionist Pattern Classifiers , 1991, NIPS.