Cost functions to estimate a posteriori probabilities in multiclass problems

The problem of designing cost functions to estimate a posteriori probabilities in multiclass problems is addressed in this paper. We establish necessary and sufficient conditions that these costs must satisfy in one-class one-output networks whose outputs are consistent with probability laws. We focus our attention on a particular subset of the corresponding cost functions; those which verify two usually interesting properties: symmetry and separability (well-known cost functions, such as the quadratic cost or the cross entropy are particular cases in this subset). Finally, we present a universal stochastic gradient learning rule for single-layer networks, in the sense of minimizing a general version of these cost functions for a wide family of nonlinear activation functions.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  A. Gualtierotti H. L. Van Trees, Detection, Estimation, and Modulation Theory, , 1976 .

[3]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[4]  J. J. Hopfield,et al.  Learning algorithms andprobability distributions infeed-forward andfeed-back networks , 1987 .

[5]  John S. Denker,et al.  Strategies for Teaching Layered Networks Classification Tasks , 1987, NIPS.

[6]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[7]  Eric A. Wan,et al.  Neural network classification: a Bayesian interpretation , 1990, IEEE Trans. Neural Networks.

[8]  Amro El-Jaroudi,et al.  A new error criterion for posterior probability estimation with neural nets , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[9]  Bruce W. Suter,et al.  The multilayer perceptron as an approximation to a Bayes optimal discriminant function , 1990, IEEE Trans. Neural Networks.

[10]  Simon Haykin,et al.  Adaptive filter theory (2nd ed.) , 1991 .

[11]  Barak A. Pearlmutter,et al.  Equivalence Proofs for Multi-Layer Perceptron Classifiers and the Bayesian Discriminant Function , 1991 .

[12]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[13]  Paul W. Munro,et al.  Repeat Until Bored: A Pattern Selection Strategy , 1991, NIPS.

[14]  Padhraic Smyth,et al.  Objective functions for probability estimation , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[15]  J. N. Kapur,et al.  Entropy optimization principles with applications , 1992 .

[16]  H. Szu,et al.  Implementing the minimum-misclassification-error energy function for target recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[17]  Shun-ichi Amari,et al.  Backpropagation and stochastic gradient descent method , 1993, Neurocomputing.

[18]  Stephen I. Gallant,et al.  Neural network learning and expert systems , 1993 .

[19]  John L. Wyatt,et al.  The Softmax Nonlinearity: Derivation Using Statistical Mechanics and Useful Properties as a Multiterminal Analog Circuit Element , 1993, NIPS.

[20]  Mohamed I. Elmasry,et al.  VLSI Artificial Neural Networks Engineering , 1994 .

[21]  Brian A. Telfer,et al.  Energy functions for minimizing misclassification error with minimum-complexity networks , 1994, Neural Networks.

[22]  Christian Cachin,et al.  Pedagogical pattern selection strategies , 1994, Neural Networks.

[23]  Sun-Yuan Kung,et al.  Decision-based neural networks with signal/image classification applications , 1995, IEEE Trans. Neural Networks.

[24]  Thomas Kailath,et al.  Classification of linearly nonseparable patterns by linear threshold elements , 1995, IEEE Trans. Neural Networks.

[25]  Perambur S. Neelakanta Csiszar's Generalized Error Measures for Gradient-descent-based Optimizations in Neural Networks Using the Backpropagation Algorithm , 1996, Connect. Sci..

[26]  Amro El-Jaroudi,et al.  A method of generating objective functions for probability estimation , 1996 .

[27]  Jesús Cid-Sueiro,et al.  Digital Equalization Using Modular Neural Networks: an Overview , 1996 .

[28]  Xiao Liu,et al.  Conditional distribution learning with neural networks and its application to channel equalization , 1997, IEEE Trans. Signal Process..

[29]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[30]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.