Appears in ECML-98 as a research note Pruning Decision Trees with Misclassi cation Costs

We describe an experimental study of pruning methods for decision tree classi ers when the goal is minimizing loss rather than error. In addition to two common methods for error minimization, CART's cost-complexity pruning and C4.5's error-based pruning, we study the extension of cost-complexity pruning to loss and one pruning variant based on the Laplace correction. We perform an empirical comparison of these methods and evaluate them with respect to loss. We found that applying the Laplace correction to estimate the probability distributions at the leaves was bene cial to all pruning methods. Unlike in error minimization, and somewhat surprisingly, performing no pruning led to results that were on par with other methods in terms of the evaluation criteria. The main advantage of pruning was in the reduction of the decision tree size, sometimes by a factor of ten. While no method dominated others on all datasets, even for the same domain di erent pruning mechanisms are better for di erent loss matrices. 1 Pruning Decision Trees Decision trees are a widely used symbolic modeling technique for classi cation tasks in machine learning. The most common approach to constructing decision tree classi ers is to grow a full tree and prune it back. Pruning is desirable because the tree that is grown may over t the data by inferring more structure than is justi ed by the training set. Speci cally, if there are no con icting instances, the training set error of a fully built tree is zero, while the true error is likely to be larger. To combat this over tting problem, the tree is pruned back with the goal of identifying the tree with the lowest error rate on previously unobserved instances, breaking ties in favor of smaller trees (Breiman, Friedman, Olshen & Stone 1984, Quinlan 1993). Several pruning methods have been introduced in the literature, including cost-complexity pruning, reduced error pruning, pessimistic pruning, error-based pruning, penalty pruning, and MDL pruning. Historically, most pruning algorithms have been developed to minimize the expected error rate of the decision tree, assuming that classi cation errors have the same unit cost. Our objective in this paper is di erent than the above-mentioned studies. Instead of pruning to minimize error, we aim to study pruning algorithms with the goal of minimizing loss. In many practical applications one has a loss matrix associated with classi cation errors (Turney 1997), and pruning should be performed with respect to the loss matrix. Pruning for loss minimization can lead to di erent pruning behavior than does pruning for error minimization. In this paper, we investigate the behavior of several pruning algorithms. In addition to the two most common methods for error minimization, cost-complexity pruning (Breiman et al. 1984) and error-based pruning (Quinlan 1993), we study the extension of cost-complexity pruning to loss and a pruning variant based on the Laplace correction (Good 1965, Cestnik 1990). We perform an empirical comparison of these methods and evaluate them with respect to loss under two di erent matrices. We found that even for the same domain, di erent pruning mechanisms are better for di erent loss matrices. In addition, we found that adjusting the probability distributions at the leaves using the Laplace correction was bene cial to all methods. 2 The Pruning Algorithms and Evaluation Criteria Most pruning algorithms perform a post-order traversal of the tree, replacing a subtree by a single leaf node when the estimated error of the leaf replacing the subtree is lower than that of the subtree. The crux of the problem is to nd an honest estimate of error (Breiman et al. 1984), which is de ned as one that is not overly optimistic for a tree that was built to minimize errors in the rst place. The resubstitution error (error rate on the training set) does not provide a suitable estimate because a leaf-node replacing a subtree will never have fewer errors on the training set than the subtree. Two commonly used pruning algorithms for error minimization are C4.5's error-based pruning (Quinlan 1993) and CART's cost-complexity pruning (Breiman et al. 1984). We attempted to extend several error-based pruning to loss-based pruning. In some cases the extensions are obvious, but C4.5's error-based pruning based on con dence intervals does not extend easily. The naive idea of computing a con dence interval for each probability and computing the losses based on the upper bound of the interval for each class yields a distribution that does not add to one. Experimental results we made on some variants (e.g. normalizing the probabilities) did not perform well. Instead, we decided to use a Laplace-based pruning method. The Laplace-based pruning method we introduce here has a similar motivation to C4.5's error-based pruning. The Laplace correction method biases the probability towards a uniform distribution. Speci cally, if a node has m instances, c of which are from a given class, in a k-class problem, the probability assigned to the class is (c+1)=(m+ k) (Good 1965, Cestnik 1990). The Laplace correction makes the distribution at the leaves more uniform and less extreme. Given a node, we can compute the expected loss using the loss matrix. The expected loss of a subtree is the sum of expected loss of the leaves. The cost-complexity-pruning (CCP) algorithm used in CART penalizes the estimated error based on the subtree size. Speci cally, the error estimate assigned to a subtree is the resubstitution error plus a factor times the subtree size. An e cient search algorithm can be used to compute all the distinct values that change the tree size and the parameter is chosen to minimize the error on a holdout sample or using cross-validation. Once the optimal value of is found, the entire training set is used to grow the tree and it is pruned using this optimal value. In our experiments, we have used the holdout method, holding back 20% of the training set to estimate the best parameter. Cost complexity pruning extends naturally to loss matrices. Instead of estimating the error of a subtree, we estimate its loss (or cost), using the resubstitution loss and penalizing by the size of the tree times the factor as in error-based CCP. 3 A Comparison of Pruning Algorithms Our goal in designing these experiments was to understand which pruning methods work well when the decision tree classi er is evaluated on loss given a loss matrix. The basic decision tree growing algorithm is implemented in MLC ++ (Kohavi, Sommer eld & Dougherty 1996) and called MC4 (MLC ++ C4.5). It is a Top-Down Decision Tree induction algorithm very similar to C4.5. The algorithm grows the decision tree following the standard methodology of choosing the best attribute according to the gain-ratio evaluation criterion and stopping when a node has two or fewer instances. The trees are pruned using the following pruning algorithms: eb-fr Error-based pruning (C4.5) with probabilities estimated using frequency counts. eb-lc Error-based pruning with probabilities estimated using the Laplace correction. np-lc No-pruning with probabilities estimated using the Laplace correction. lp Laplace-based pruning with probabilities estimated using the Laplace correction. ccp-lc Cost-complexity pruning based on loss with probabilities estimated using the Laplace correction. The leaves of the trees are labeled with the class that minimizes expected loss based on the probability estimates at each leaf. In our initial experiments, the Laplace correction outperformed frequency counts in all variants. Therefore, excluding the basic method of error-based-pruning, all other pruning methods were run with the Laplace correction. Ten datasets were chosen from the UCI repository (Merz & Murphy 1997): adult (salary classi cation based on census bureau data), breast cancer diagnosis, chess, crx (credit), german (credit), pima diabetes, road (dirt), satellite images, shuttle, and vehicle. In choosing the datasets, we decided on the following desiderata: 1. Datasets should be two-class to make the evaluation easier. This desideratum was hard to satisfy and we resorted to converting several multi-class problems into two-class problems by choosing the least prevalent class as the goal class. 2. Datasets should not have too many unknowns. To avoid another factor in this evaluation, we removed all instances with unknown values from the les. 3. The standard error of the estimated loss should be small. This was very important because with loss matrices the standard deviations of the estimates can be large. We therefore decided to require at least 500 instances and train on only 25% of the data, leaving the remaining instances for testing. We wanted to test the following hypotheses: 1. The Laplace correction for estimating probabilities at the leaves leads to lower loss than frequency counts. 2. Considering the loss matrix during pruning leads to lower loss than pruning based on errors. For all datasets we trained on 25% of the data and tested on 75% of the data, repeating the process 10 times. We compared performance of the pruning algorithms on two di erent loss matrices, which respectively set a loss of 10 and 100 for misclassifying the less frequent of the two classes. This was done to simulate real-world scenarios in which the less frequent class is the important class. Experiments were also done with the losses reversed, with similar conclusions to those shown below. The results are displayed as graphs showing the average loss for the ten les as bars using the scale on the left, and the average relative loss as X-symbols with the scale on the right. The relative losses are computed as the ratio between the loss of the pruning method and eb-fr, our baseline method. These ratios are then averaged across the ten datasets to create summary graphs. In cases for which the losses are small, the ratio is a better indicator of performance. The averag