Linear and Order Statistics Combiners for Pattern Classification

Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This chapter provides an analytical framework to quantify the improvements in classification results due to combining. The results apply to both linear combiners and order statistics combiners. We first show that to a first order approximation, the error rate obtained over and above the Bayes error rate, is directly proportional to the variance of the actual decision boundaries around the Bayes optimum boundary. Combining classifiers in output space reduces this variance, and hence reduces the 'added' error. If N unbiased classifiers are combined by simple averaging. the added error rate can be reduced by a factor of N if the individual errors in approximating the decision boundaries are uncorrelated. Expressions are then derived for linear combiners which are biased or correlated, and the effect of output correlations on ensemble performance is quantified. For order statistics based non-linear combiners, we derive expressions that indicate how much the median, the maximum and in general the i-th order statistic can improve classifier performance. The analysis presented here facilitates the understanding of the relationships among error rates, classifier boundary distributions, and combining in output space. Experimental results on several public domain data sets are provided to illustrate the benefits of combining and to support the analytical results.

[1]  Corinna Cortes,et al.  Boosting Decision Trees , 1995, NIPS.

[2]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[3]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Kevin Knight,et al.  Artificial intelligence (2. ed.) , 1991 .

[5]  Sherif Hashem Bruce Schmeiser Approximating a Function and its Derivatives Using MSE-Optimal Linear Combinations of Trained Feedfo , 1993 .

[6]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[7]  David W. Opitz,et al.  Generating Accurate and Diverse Members of a Neural-Network Ensemble , 1995, NIPS.

[8]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[9]  Josef Skrzypek,et al.  Synergy of Clustering Multiple Back Propagation Networks , 1989, NIPS.

[10]  C. Sitthi-amorn,et al.  Bias , 1993, The Lancet.

[11]  J. Mesirov,et al.  Hybrid system for protein secondary structure prediction. , 1992, Journal of molecular biology.

[12]  Salvatore J. Stolfo,et al.  A Comparative Evaluation of Voting and Meta-learning on Partitioned Data , 1995, ICML.

[13]  J. Gibrat,et al.  Secondary structure prediction: combination of three different methods. , 1988, Protein engineering.

[14]  Galina L. Rogova,et al.  Combining the results of several neural network classifiers , 1994, Neural Networks.

[15]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[16]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[17]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[18]  B. Arnold,et al.  A first course in order statistics , 1994 .

[19]  Nils J. Nilsson,et al.  Learning Machines: Foundations of Trainable Pattern-Classifying Systems , 1965 .

[20]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, COLT '89.

[21]  Anil K. Jain,et al.  Bootstrap Techniques for Error Estimation , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[23]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[24]  David S. Touretzky,et al.  Learning with Ensembles: How Over--tting Can Be Useful , 1996 .

[25]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[26]  David H. Wolpert,et al.  The Existence of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[27]  L. Cooper,et al.  When Networks Disagree: Ensemble Methods for Hybrid Neural Networks , 1992 .

[28]  Lutz Prechelt,et al.  PROBEN 1 - a set of benchmarks and benchmarking rules for neural network training algorithms , 1994 .

[29]  Paul W. Munro,et al.  Reducing Variance of Committee Prediction with Resampling Techniques , 1996, Connect. Sci..

[30]  Joydeep Ghosh,et al.  A neural network based hybrid system for detection, characterization, and classification of short-duration oceanic signals , 1992 .

[31]  Randy L. Shimabukuro,et al.  Least-Squares Learning and Approximation of Posterior Probabilities on Classification Problems by Neural Network Models , 1991 .

[32]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[33]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[34]  David B. Leake Artiicial Intelligence , 2001 .

[35]  David H. Wolpert,et al.  A Mathematical Theory of Generalization: Part II , 1990, Complex Syst..

[36]  Kagan Tumer,et al.  Classifier combining through trimmed means and order statistics , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[37]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[38]  M. Singh,et al.  An Evidential Reasoning Approach for Multiple-Attribute Decision Making with Uncertainty , 1994, IEEE Trans. Syst. Man Cybern. Syst..

[39]  A. E. Sarhan,et al.  Estimation of Location and Scale Parameters by Order Statistics from Singly and Doubly Censored Samples , 1956 .

[40]  Harris Drucker,et al.  Boosting and Other Ensemble Methods , 1994, Neural Computation.

[41]  David Heckerman,et al.  Probabilistic Interpretation for MYCIN's Certainty Factors , 1990, UAI.

[42]  Kagan Tumer,et al.  Analysis of decision boundaries in linearly combined neural classifiers , 1996, Pattern Recognit..

[43]  Bruce E. Rosen,et al.  Ensemble Learning Using Decorrelated Neural Networks , 1996, Connect. Sci..

[44]  T. Fearn The Jackknife , 2000 .

[45]  Robert A. Jacobs,et al.  Methods For Combining Experts' Probability Assessments , 1995, Neural Computation.

[46]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[47]  Alexander H. Waibel,et al.  The Meta-Pi Network: Building Distributed Knowledge Representations for Robust Multisource Pattern Recognition , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  John D. Lowrance,et al.  An Inference Technique for Integrating Knowledge from Disparate Sources , 1981, IJCAI.

[49]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[50]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[51]  Jenq-Neng Hwang,et al.  Integration of neural networks and decision tree classifiers for automated cytology screening , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[52]  Bhagavatula Vijaya Kumar,et al.  Learning ranks with neural networks , 1995, SPIE Defense + Commercial Sensing.

[53]  Joydeep Ghosh,et al.  Integration Of Neural Classifiers For Passive Sonar Signals , 1996 .

[54]  Jeffrey A. Barnett,et al.  Computational Methods for a Mathematical Theory of Evidence , 1981, IJCAI.

[55]  Jude W. Shavlik,et al.  Interpretation of Artificial Neural Networks: Mapping Knowledge-Based Neural Networks into Rules , 1991, NIPS.

[56]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Kagan Tumer,et al.  Estimating the Bayes error rate through classifier combining , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[58]  O. Mangasarian,et al.  Multisurface method of pattern separation for medical diagnosis applied to breast cytology. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[60]  Bruce W. Suter,et al.  The multilayer perceptron as an approximation to a Bayes optimal discriminant function , 1990, IEEE Trans. Neural Networks.

[61]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[62]  William G. Baxt,et al.  Improving the Accuracy of an Artificial Neural Network Using Multiple Differently Trained Networks , 1992, Neural Computation.

[63]  J. Davenport Editor , 1960 .

[64]  Kagan Tumer,et al.  Linear and order statistics combiners for reliable pattern classification , 1996 .

[65]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[66]  Anders Krogh,et al.  Learning with ensembles: How overfitting can be useful , 1995, NIPS.

[67]  Seymour Shlien,et al.  Multiple binary decision tree classifiers , 1990, Pattern Recognit..

[68]  Kagan Tumer,et al.  Structural adaptation and generalization in supervised feed-forward networks , 1994 .

[69]  Jerome H. Friedman,et al.  An Overview of Predictive Learning and Function Approximation , 1994 .

[70]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[71]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[72]  Kagan Tumer,et al.  Limits to performance gains in combined neural classifiers , 1995 .

[73]  D. Farnsworth A First Course in Order Statistics , 1993 .

[74]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[75]  Johannes R. Sveinsson,et al.  Parallel consensual neural networks , 1997, IEEE Trans. Neural Networks.

[76]  O. Mangasarian,et al.  Pattern Recognition Via Linear Programming: Theory and Application to Medical Diagnosis , 1989 .

[77]  Harris Drucker,et al.  Improving Performance in Neural Networks Using a Boosting Algorithm , 1992, NIPS.

[78]  Roberto Battiti,et al.  Democracy in neural nets: Voting schemes for classification , 1994, Neural Networks.

[79]  Alice E. Smith,et al.  COMMITTEE NETWORKS BY RESAMPLING , 1998 .

[80]  Kagan Tumer,et al.  Order Statistics Combiners for Neural Classifiers 1 , 1995 .

[81]  Kamal A. Ali,et al.  On the Link between Error Correlation and Error Reduction in Decision Tree Ensembles , 1995 .

[82]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[83]  Ronny Meir,et al.  Bias, variance and the combination of estimators; The case of linear least squares , 1995 .

[84]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[85]  Michael J. Pazzani,et al.  Combining Neural Network Regression Estimates with Regularized Linear Weights , 1996, NIPS.

[86]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.