The Foundations of Cost-Sensitive Learning

This paper revisits the problem of optimal learning and decision-making when different misclassification errors incur different penalties. We characterize precisely but intuitively when a cost matrix is reasonable, and we show how to avoid the mistake of defining a cost matrix that is economically incoherent. For the two-class case, we prove a theorem that shows how to change the proportion of negative examples in a training set in order to make optimal cost-sensitive classification decisions using a classifier learned by a standard non-costsensitive learning method. However, we then argue that changing the balance of negative and positive training examples has little effect on the classifiers produced by standard Bayesian and decision tree learning methods. Accordingly, the recommended way of applying one of these methods in a domain with differing misclassification costs is to learn a classifier from the training set as given, and then to compute optimal decisions explicitly using the probability estimates given by the classifier. 1 Making decisions based on a cost matrix Given a specification of costs for correct and incorrect predictions, an example should be predicted to have the class that leads to the lowest expected cost, where the expectation is computed using the conditional probability of each class given the example. Mathematically, let the entry in a cost matrix be the cost of predicting class when the true class is . If then the prediction is correct, while if the prediction is incorrect. The optimal prediction for an example is the class that minimizes ! (1) Costs are not necessarily monetary. A cost can also be a waste of time, or the severity of an illness, for example. For each , is a sum over the alternative possibilities for the true class of . In this framework, the role of a learning algorithm is to produce a classifier that for any example can estimate the probability " # of each class being the true class of . For an example , making the prediction means acting as if is the true class of . The essence of cost-sensitive decision-making is that it can be optimal to act as if one class is true even when some other class is more probable. For example, it can be rational not to approve a large credit card transaction even if the transaction is most likely legitimate. 1.1 Cost matrix properties A cost matrix always has the following structure when there are only two classes: actual negative actual positive predict negative $% $& ' )(!*+* $% -,. /(!*10 predict positive 2,& $& ' )(302* 2,& -,. /(30+0 Recent papers have followed the convention that cost matrix rows correspond to alternative predicted classes, while columns correspond to actual classes, i.e. row/column = / = predicted/actual. In our notation, the cost of a false positive is (302* while the cost of a false negative is (!*!0 . Conceptually, the cost of labeling an example incorrectly should always be greater than the cost of labeling it correctly. Mathematically, it should always be the case that ( 0 *54 ( *+* and ( *!064 ( 0 0 . We call these conditions the “reasonableness” conditions. Suppose that the first reasonableness condition is violated, so (!*+*879(302* but still (!*!0 4 (30+0 . In this case the optimal policy is to label all examples positive. Similarly, if (:0 * 4 (!*+* but (30 0;7 in a cost matrix if for all , = ?7 > @ A . In this case the cost of predicting > is no greater than the cost of predicting = , regardless of what the true class is. So it is optimal never to predict = . As a special case, the optimal prediction is always > if row > is dominated by all other rows in a cost matrix. The two reasonableness conditions for a two-class cost matrix imply that neither row in the matrix dominates the other. Given a cost matrix, the decisions that are optimal are unchanged if each entry in the matrix is multiplied by a positive constant. This scaling corresponds to changing the unit of account for costs. Similarly, the decisions that are optimal are unchanged B if a constant is added to each entry in the matrix. This shifting corresponds to changing the baseline away from which costs are measured. By scaling and shifting entries, any two-class cost matrix that satisfies the reasonableness conditions can be transformed into a simpler matrix that always leads to the same decisions:

[1]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[2]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[3]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[4]  Thomas G. Dietterich,et al.  Applying the Waek Learning Framework to Understand and Improve C4.5 , 1996, ICML.

[5]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[6]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[7]  Carla E. Brodley,et al.  Pruning Decision Trees with Misclassification Costs , 1998, ECML.

[8]  Yishay Mansour,et al.  On the Boosting Ability of Top-Down Decision Tree Learning Algorithms , 1999, J. Comput. Syst. Sci..

[9]  Robert C. Holte,et al.  Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria , 2000, ICML.

[10]  øöö Blockinøø Well-Trained PETs : Improving Probability Estimation , 2000 .

[11]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[12]  Thomas G. Dietterich,et al.  Bootstrap Methods for the Cost-Sensitive Evaluation of Classifiers , 2000, ICML.

[13]  Foster Provost,et al.  The effect of class distribution on classifier learning , 2001 .

[14]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[15]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.