Cost Complexity-Based Pruning of Ensemble Classifiers

Abstract. In this paper we study methods that combine multiple classification models learned over separate data sets. Numerous studies posit that such approaches provide the means to efficiently scale learning to large data sets, while also boosting the accuracy of individual classifiers. These gains, however, come at the expense of an increased demand for run-time system resources. The final ensemble meta-classifier may consist of a large collection of base classifiers that require increased memory resources while also slowing down classification throughput. Here, we describe an algorithm for pruning (i.e., discarding a subset of the available base classifiers) the ensemble meta-classifier as a means to reduce its size while preserving its accuracy and we present a technique for measuring the trade-off between predictive performance and available run-time system resources. The algorithm is independent of the method used initially when computing the meta-classifier. It is based on decision tree pruning methods and relies on the mapping of an arbitrary ensemble meta-classifier to a decision tree model. Through an extensive empirical study on meta-classifiers computed over two real data sets, we illustrate our pruning algorithm to be a robust and competitive approach to discarding classification models without degrading the overall predictive performance of the smaller ensemble computed over those that remain after pruning.

[1]  R. Tibshirani,et al.  Combining Estimates in Regression and Classification , 1996 .

[2]  Volker Tresp,et al.  Combining Estimators Using Non-Constant Weighting Functions , 1994, NIPS.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[5]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .

[6]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[7]  Salvatore J. Stolfo,et al.  Management of intelligent learning agents in distributed data mining systems , 1999 .

[8]  Salvatore J. Stolfo,et al.  A Multiple Model Cost-Sensitive Approach for Intrusion Detection , 2000, ECML.

[9]  Salvatore J. Stolfo,et al.  Effective and Efficient Pruning of Meta-Classifiers in a Distributed Data Mining System , 1998 .

[10]  Pedro M. Domingos Knowledge Acquisition from Examples Via Multiple Models , 1997 .

[11]  S. Stolfo,et al.  Pruning Meta-Classifiers in a Distributed Data Mining System , 1998 .

[12]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[13]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[14]  Thomas G. Dietterich,et al.  Error-Correcting Output Coding Corrects Bias and Variance , 1995, ICML.

[15]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[16]  Salvatore J. Stolfo,et al.  Agent-Based Distributed Learning Applied to Fraud Detection , 1999 .

[17]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[18]  Tom Fawcett,et al.  Robust Classification Systems for Imprecise Environments , 1998, AAAI/IAAI.

[19]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[20]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[21]  L. Cooper,et al.  When Networks Disagree: Ensemble Methods for Hybrid Neural Networks , 1992 .

[22]  Steve R. Waterhouse,et al.  Classification using hierarchical mixtures of experts , 1994, Proceedings of IEEE Workshop on Neural Networks for Signal Processing.

[23]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994, Neural Computation.

[24]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[25]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[26]  Jude W. Shavlik,et al.  in Advances in Neural Information Processing , 1996 .

[27]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[28]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[29]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[30]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.

[31]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[32]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[33]  David W. Opitz,et al.  Generating Accurate and Diverse Members of a Neural-Network Ensemble , 1995, NIPS.

[34]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[35]  Shlomo Argamon,et al.  Arbitrating Among Competing Classifiers Using Learned Referees , 2001, Knowledge and Information Systems.

[36]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[37]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.