A global optimization technique for statistical classifier design

A global optimization method is introduced that minimize the rate of misclassification. We first derive the theoretical basis for the method, on which we base the development of a novel design algorithm and demonstrate its effectiveness and superior performance in the design of practical classifiers for some of the most popular structures currently in use. The method, grounded in ideas from statistical physics and information theory, extends the deterministic annealing approach for optimization, both to incorporate structural constraints on data assignments to classes and to minimize the probability of error as the cost objective. During the design, data are assigned to classes in probability so as to minimize the expected classification error given a specified level of randomness, as measured by Shannon's entropy. The constrained optimization is equivalent to a free-energy minimization, motivating a deterministic annealing approach in which the entropy and expected misclassification cost are reduced with the temperature while enforcing the classifier's structure. In the limit, a hard classifier is obtained. This approach is applicable to a variety of classifier structures, including the widely used prototype-based, radial basis function, and multilayer perceptron classifiers. The method is compared with learning vector quantization, back propagation (BP), several radial basis function design techniques, as well as with paradigms for more directly optimizing all these structures to minimize probability of error. The annealing method achieves significant performance gains over other design methods on a number of benchmark examples from the literature, while often retaining design complexity comparable with or only moderately greater than that of strict descent methods. Substantial gains, both inside and outside the training set, are achieved for complicated examples involving high-dimensional data and large class overlap.

[1]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[2]  W. Highleyman Linear Decision Functions, with Application to Pattern Recognition , 1962, Proceedings of the IRE.

[3]  FRED W. SMITH,et al.  Pattern Classifier Design by Linear Programming , 1968, IEEE Transactions on Computers.

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Hai Do-Tu,et al.  Learning Algorithms for Nonparametric Solution to the Minimum Error Classification Problem , 1978, IEEE Transactions on Computers.

[8]  Rodney W. Johnson,et al.  Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy , 1980, IEEE Trans. Inf. Theory.

[9]  Jan M. Van Campenhout,et al.  Maximum entropy and conditional probability , 1981, IEEE Trans. Inf. Theory.

[10]  Scott Kirkpatrick,et al.  Optimization by Simmulated Annealing , 1983, Sci..

[11]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[12]  I. Csiszár Sanov Property, Generalized $I$-Projection and a Conditional Limit Theorem , 1984 .

[13]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[14]  D. Chandler,et al.  Introduction To Modern Statistical Mechanics , 1987 .

[15]  T. Kohonen,et al.  Statistical pattern recognition with neural networks: benchmarking studies , 1988, IEEE 1988 International Conference on Neural Networks.

[16]  Eduardo D. Sontag,et al.  Backpropagation separates when perceptrons do , 1989, International 1989 Joint Conference on Neural Networks.

[17]  J. Makhoul,et al.  Formation of disconnected decision regions with a single hidden layer , 1989, International 1989 Joint Conference on Neural Networks.

[18]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[19]  Waibel A novel objective function for improved phoneme recognition using time delay neural networks , 1989 .

[20]  Robert J. Marks,et al.  A performance comparison of trained multilayer perceptrons and trained classification trees , 1989, Conference Proceedings., IEEE International Conference on Systems, Man and Cybernetics.

[21]  David E. van den Bout,et al.  Graph partitioning using annealed neural networks , 1990, International 1989 Joint Conference on Neural Networks.

[22]  Ronald A. Cole,et al.  A performance comparison of trained multilayer perceptrons and trained classification trees , 1990 .

[23]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[24]  Alan L. Yuille,et al.  Generalized Deformable Models, Statistical Physics, and Matching Problems , 1990, Neural Computation.

[25]  Eric A. Wan,et al.  Neural network classification: a Bayesian interpretation , 1990, IEEE Trans. Neural Networks.

[26]  Bruce W. Suter,et al.  The multilayer perceptron as an approximation to a Bayes optimal discriminant function , 1990, IEEE Trans. Neural Networks.

[27]  Richard P. Lippmann,et al.  Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[28]  Petar D. Simic,et al.  Statistical mechanics as the underlying theory of ‘elastic’ and ‘neural’ optimisations , 1990 .

[29]  D. R. Hush,et al.  Error surfaces for multi-layer perceptrons , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[30]  G. Bilbro,et al.  Mean-field approximation minimizes relative entropy , 1991 .

[31]  Petar D. Simic Constrained Nets for Graph Matching and Other Quadratic Assignment Problems , 1991, Neural Comput..

[32]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[33]  Federico Girosi,et al.  Parallel and Deterministic Algorithms from MRFs: Surface Reconstruction , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Philip A. Chou,et al.  Optimal Partitioning for Classification and Regression Trees , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Geoffrey C. Fox,et al.  Vector quantization by deterministic annealing , 1992, IEEE Trans. Inf. Theory.

[36]  Mohamad T. Musavi,et al.  On the training of radial basis function classifiers , 1992, Neural Networks.

[37]  Don R. Hush,et al.  Error surfaces for multilayer perceptrons , 1992, IEEE Trans. Syst. Man Cybern..

[38]  Lodewyk F. A. Wessels,et al.  Avoiding False Local Initialization of Minima by Proper Connections , 1992 .

[39]  Jun Zhang The mean field theory in EM procedures for Markov random fields , 1992, IEEE Trans. Signal Process..

[40]  V. Nedeljkovic,et al.  A novel multilayer neural networks training algorithm that minimizes the probability of classification error , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[41]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[42]  Saul B. Gelfand,et al.  Classification trees with neural network feature extraction , 1992, IEEE Trans. Neural Networks.

[43]  A. Gersho Optimal Vector Quantized Nonlinear Estimation , 1993, Proceedings. IEEE International Symposium on Information Theory.

[44]  D.R. Hush,et al.  Progress in supervised neural networks , 1993, IEEE Signal Processing Magazine.

[45]  Geoffrey C. Fox,et al.  Constrained Clustering as an Optimization Method , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[47]  Kenneth Rose,et al.  Deterministic annealing for trellis quantizer and HMM design using Baum-Welch re-estimation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Brian D. Ripley,et al.  Neural Networks and Related Methods for Classification , 1994 .

[49]  Alan L. Yuille,et al.  Statistical Physics, Mixtures of Distributions, and the EM Algorithm , 1994, Neural Computation.

[50]  Brian A. Telfer,et al.  Energy functions for minimizing misclassification error with minimum-complexity networks , 1994, Neural Networks.

[51]  Stephen J. Roberts,et al.  Supervised and unsupervised learning in radial basis function classifiers , 1994 .

[52]  David J. Miller,et al.  An information-theoretic framework for optimization with applications in source coding and pattern recognition , 1995 .

[53]  David J. Miller,et al.  Generalized vector quantization: jointly optimal quantization and estimation , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[54]  David J. Miller,et al.  An information-theoretic framework for optimization with application to supervised learning , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[55]  Shigeo Abe,et al.  Neural Networks and Fuzzy Systems , 1996, Springer US.