REPMAC: A New Hybrid Approach to Highly Imbalanced Classification Problems

The class imbalance problem (when one of the classes has much less samples than the others) is of great importance in machine learning, because it corresponds to many critical applications. In this work we introduce the recursive partitioning of the majority class (REPMAC) algorithm, a new hybrid method to solve imbalanced problems. Using a clustering method, REPMAC recursively splits the majority class in several subsets, creating a decision tree, until the resulting sub-problems are balanced or easy to solve. At that point, a classifier is fitted to each sub-problem. We evaluate the new method on 7 datasets from the UCI repository, finding that REPMAC is more efficient than other methods usually applied to imbalanced datasets.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  Yue-Shi Lee,et al.  Cluster-Based Sampling Approaches to Imbalanced Data Distributions , 2006, DaWaK.

[3]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[4]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[5]  Ulf Brefeld,et al.  Support Vector Machines with Example Dependent Costs , 2003, ECML.

[6]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[7]  Ting Yu,et al.  Combine Vector Quantization and Support Vector Machine for Imbalanced Datasets , 2006, IFIP AI.

[8]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[9]  Da-Ren Yu,et al.  A Weighted Rough Set Method to Address the Class Imbalance Problem , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[10]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[11]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[12]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[13]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[14]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[15]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[16]  Yuan-chin Ivan Chang,et al.  Boosting SVM Classifiers with Logistic Regression , 2003 .

[17]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.