A Comparison of Pruning Criteria for Probability Trees

Probability trees are decision trees that predict class probabilities rather than the most likely class. The pruning criterion used to learn a probability tree strongly influences the size of the tree and thereby also the quality of its probability estimates. While the effect of pruning criteria on classification accuracy is well-studied, only recently has there been more interest in the effect on probability estimates. Hence, it is currently unclear which pruning criteria for probability trees are preferable under which circumstances.In this paper we survey six of the most important pruning criteria for probability trees, and discuss their theoretical advantages and disadvantages. We also perform an extensive experimental study of the relative performance of these pruning criteria. The main conclusion is that overall a pruning criterion based on randomization tests performs best because it is most robust to extreme data characteristics (such as class skew or a high number of classes). We also identify and explain several shortcomings of the other pruning criteria.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  Tom Fawcett,et al.  Using rule sets to maximize ROC performance , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  Paul R. Cohen,et al.  Multiple Comparisons in Induction Algorithms , 2000, Machine Learning.

[4]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[5]  Jennifer Neville,et al.  Randomization Tests for Relational Learning , 2003 .

[6]  David D. Jensen,et al.  Adjusting for Multiple Comparisons in Decision Tree Pruning , 1997, KDD.

[7]  Saso Dzeroski,et al.  First order random forests: Learning relational classifiers with complex aggregates , 2006, Machine Learning.

[8]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[9]  Jorma Rissanen,et al.  MDL-Based Decision Tree Pruning , 1995, KDD.

[10]  Bin Wang,et al.  Improving the Ranking Performance of Decision Trees , 2006, ECML.

[11]  C. S. Wallace,et al.  Coding Decision Trees , 1993, Machine Learning.

[12]  Donato Malerba,et al.  A Comparative Analysis of Methods for Pruning Decision Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Luc De Raedt,et al.  Lookahead and Discretization in ILP , 1997, ILP.

[14]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[15]  Peter A. Flach,et al.  Decision Trees for Ranking: Effect of new smoothing methods, new splitting criteria and simple pruning methods , 2003 .

[16]  Nir Friedman,et al.  Learning Bayesian Networks with Local Structure , 1996, UAI.

[17]  P. Cohen,et al.  Overtting Explained , 1997 .

[18]  Pedro M. Domingos Occam's Two Razors: The Sharp and the Blunt , 1998, KDD.

[19]  Maurice Bruynooghe,et al.  Generalized Ordering-Search for Learning Directed Probabilistic Logical Models , 2006, ILP.

[20]  Remco R. Bouckaert,et al.  Estimating replicability of classifier learning experiments , 2004, ICML.

[21]  David Maxwell Chickering,et al.  Dependency Networks for Inference, Collaborative Filtering, and Data Visualization , 2000, J. Mach. Learn. Res..

[22]  Maurice Bruynooghe,et al.  Mining data from intensive care patients , 2007, Adv. Eng. Informatics.

[23]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[24]  Luc De Raedt,et al.  Bayesian Logic Programming: Theory and Tool , 2007 .

[25]  Tim Oates,et al.  Large Datasets Lead to Overly Complex Models: An Explanation and a Solution , 1998, KDD.

[26]  Pedro M. Domingos,et al.  Learning Bayesian network classifiers by maximizing conditional likelihood , 2004, ICML.

[27]  Maurice Bruynooghe,et al.  Logical Bayesian Networks and Their Relation to Other Probabilistic Logical Models , 2005, BNAIC.

[28]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[29]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[30]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[31]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[32]  Jennifer Neville,et al.  Learning relational probability trees , 2003, KDD '03.

[33]  S. Džeroski,et al.  Relational Data Mining , 2001, Springer Berlin Heidelberg.

[34]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[35]  Peter A. Flach,et al.  Improving the AUC of Probabilistic Estimation Trees , 2003, ECML.

[36]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[37]  José Hernández-Orallo,et al.  Volume under the ROC Surface for Multi-class Problems , 2003, ECML.

[38]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[39]  Ian H. Witten,et al.  Using a Permutation Test for Attribute Selection in Decision Trees , 1998, ICML.

[40]  Ben Taskar,et al.  Bayesian Logic Programming: Theory and Tool , 2007 .

[41]  C. Ling,et al.  Decision Tree with Better Ranking , 2003, ICML.

[42]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[43]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[44]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[45]  Maurice Bruynooghe,et al.  A Comparison of Approaches for Learning Probability Trees , 2005, ECML.

[46]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.