The class imbalance problem is pervasive in machine learning. To accurately classify the minority class, current methods rely on sampling schemes to close the gap between classes, or on the application of error costs to create algorithms which favor the minority class. Since the sampling schemes and costs must be specified, these methods are highly dependent on the class distributions present in the training set. This makes them difficult to apply in settings where the level of imbalance changes, such as in online streaming data. Often they cannot handle multi-class problems. We present a novel single-class algorithm called Class Conditional Nearest Neighbor Distribution (CCNND), which mitigates the effects of class imbalance through local geometric structure in the data. Our algorithm can be applied seamlessly to problems with any level of imbalance or number of classes, and new examples are simply added to the training set. We show that it performs as well as or better than top sampling and cost-weighting methods on four imbalanced datasets from the UCI Machine Learning Repository, and then apply it to streaming data from the oil and gas industry alongside a modified nearest neighbor algorithm. Our algorithm's competitive performance relative to the state-of-the-art, coupled with its extremely simple implementation and automatic adjustment for minority classes, demonstrates that it is worth further study.
[1]
Nitesh V. Chawla,et al.
Editorial: special issue on learning from imbalanced data sets
,
2004,
SKDD.
[2]
Nitesh V. Chawla,et al.
SMOTE: Synthetic Minority Over-sampling Technique
,
2002,
J. Artif. Intell. Res..
[3]
Nello Cristianini,et al.
Controlling the Sensitivity of Support Vector Machines
,
1999
.
[4]
Stan Matwin,et al.
Addressing the Curse of Imbalanced Training Sets: One-Sided Selection
,
1997,
ICML.
[5]
VARUN CHANDOLA,et al.
Anomaly detection: A survey
,
2009,
CSUR.
[6]
Songbo Tan,et al.
Neighbor-weighted K-nearest neighbor for unbalanced text corpus
,
2005,
Expert Syst. Appl..
[7]
Stephen Kwek,et al.
Applying Support Vector Machines to Imbalanced Datasets
,
2004,
ECML.