On the Probabilistic Interpretation of Neural Network Classifiers and Discriminative Training Criteria

A probabilistic interpretation is presented for two important issues in neural network based classification, namely the interpretation of discriminative training criteria and the neural network outputs as well as the interpretation of the structure of the neural network. The problem of finding a suitable structure of the neural network can be linked to a number of well established techniques in statistical pattern recognition. Discriminative training of neural network outputs amounts to approximating the class or posterior probabilities of the classical statistical approach. This paper extends these links by introducing and analyzing novel criteria such as maximizing the class probability and minimizing the smoothed error rate. These criteria are defined in the framework of class conditional probability density functions. We show that these criteria can be interpreted in terms of weighted maximum likelihood estimation. In particular, this approach covers widely used techniques such as corrective training, learning vector quantization, and linear discriminant analysis. >

[1]  Naftali Tishby,et al.  Consistent inference of probabilities in layered networks: predictions and generalizations , 1989, International 1989 Joint Conference on Neural Networks.

[2]  David J. Hand,et al.  Kernel Discriminant Analysis , 1983 .

[3]  H. Gish,et al.  A probabilistic approach to the understanding and training of neural network classifiers , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[4]  N. Otsu,et al.  Nonlinear data analysis and multilayer perceptrons , 1989, International 1989 Joint Conference on Neural Networks.

[5]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[6]  James K. Baker,et al.  Stochastic modeling for automatic speech understanding , 1990 .

[7]  David Lowe,et al.  The optimised internal representation of multilayer classifier networks performs nonlinear discriminant analysis , 1990, Neural Networks.

[8]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[9]  Harvey F. Silverman,et al.  Neural networks, maximum mutual information training, and maximum likelihood training (speech recognition) , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[11]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[12]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[13]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[14]  Andrew J. Viterbi,et al.  Principles of Digital Communication and Coding , 1979 .

[15]  Esther Levin,et al.  Accelerated Learning in Layered Neural Networks , 1988, Complex Syst..

[16]  Alexander H. Waibel,et al.  A novel objective function for improved phoneme recognition using time delay neural networks , 1990, International 1989 Joint Conference on Neural Networks.

[17]  T. Kohonen,et al.  Statistical pattern recognition with neural networks: benchmarking studies , 1988, IEEE 1988 International Conference on Neural Networks.

[18]  Lalit R. Bahl,et al.  A new algorithm for the estimation of hidden Markov model parameters , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[19]  Baxter F. Womack,et al.  An Adaptive Pattern Classification System , 1966, IEEE Trans. Syst. Sci. Cybern..

[20]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[21]  Frederick Jelinek,et al.  Speech Recognition by Statistical Methods , 1976 .

[22]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[23]  P. GALLINARI,et al.  On the relations between discriminant analysis and multilayer perceptrons , 1991, Neural Networks.

[24]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[26]  Waibel A novel objective function for improved phoneme recognition using time delay neural networks , 1989 .

[27]  Colin C. Blaydon,et al.  On a pattern classification result of Aizerman, Braverman, and Rozonoer (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[28]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[29]  David G. Lowe,et al.  Optimized Feature Extraction and the Bayes Decision in Feed-Forward Classifier Networks , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[32]  Alex Waibel,et al.  Phoneme recognition: neural networks vs. hidden Markov models vs. hidden Markov models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[33]  Arthur Nadas,et al.  Optimal solution of a training problem in speech recognition , 1985, IEEE Trans. Acoust. Speech Signal Process..

[34]  Calyampudi Radhakrishna Rao,et al.  Linear Statistical Inference and its Applications , 1967 .

[35]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[36]  S. Renals,et al.  Phoneme classification experiments using radial basis functions , 1989, International 1989 Joint Conference on Neural Networks.

[37]  Yariv Ephraim,et al.  Estimation of hidden Markov model parameters by minimizing empirical error rate , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[38]  Kai-Fu Lee,et al.  Corrective and reinforcement learning for speaker-independent continuous speech recognition , 1989, EUROSPEECH.

[39]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[40]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[41]  Amro El-Jaroudi,et al.  A new error criterion for posterior probability estimation with neural nets , 1990, 1990 IJCNN International Joint Conference on Neural Networks.