Strategies for learning in class imbalance problems

A set of examples or training set (TS) is said to be imbalanced if one of the classes is represented by a very small number of cases compared to the other classes. Following the common practice [1,2], we consider only two-class problems and therefore, the examples are either positive or negative (that is, either from the minority class or the majority class, respectively). High imbalance occur in applications where the classi;er is to detect a rare but important case, such as fraudulent telephone calls, oil spills in satellite images, failures in a manufacturing process, or a rare medical diagnoses. It has been observed that class imbalance may cause a signi;cant deterioration in the performance attainable by standard supervised methods. Most of the attempts at dealing with this problem can be grouped into three categories [2]. One is to assign distinct costs to the classi;cation errors. The second is to resample the original TS, either by over-sampling the minority class and/or under-sampling the majority class until the classes are approximately equally represented. The third consists in internally biasing the discrimination-based process so as to compensate for the class imbalance. As pointed out by many authors, the performance of a classi;er in applications with class imbalance must not be expressed in terms of the average accuracy. For instance, consider a domain where only 2% examples are positive. In such a situation, labeling all new samples as negative would give an accuracy of 98%, but failing on all positive cases. Consequently, in environments with imbalanced