Evaluation of a new hybrid algorithm for highly imbalanced classification problems

Many times in classification problems, particularly in critical real world applications, one of the classes has much less samples than the others usually known as the class imbalance problem. In this work we discuss and evaluate the use of the REPMAC algorithm to solve imbalanced problems. Using a clustering method, REPMAC recursively splits the majority class in several subsets, creating a decision tree, until the resulting sub-problems are balanced or easy to solve. We use two diverse clustering methods and three different classifiers coupled with REPMAC to evaluate the new method on several benchmark datasets spanning a wide range of number of features, samples and imbalance degree. We also apply our method to a real world problem, the identification of weed seeds. We find that the good performance of REPMAC is almost independent of the classifier or the clustering method coupled to it, which suggests that its success is mostly related to the use of an appropriate strategy to cope with imbalanced problems.

[1]  Pablo M. Granitto,et al.  REPMAC: A New Hybrid Approach to Highly Imbalanced Classification Problems , 2008, 2008 Eighth International Conference on Hybrid Intelligent Systems.

[2]  R. Tibshirani,et al.  Penalized Discriminant Analysis , 1995 .

[3]  Pablo M. Granitto,et al.  Large-scale investigation of weed seed identification by machine vision , 2005 .

[4]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[5]  Vasile Palade,et al.  FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning , 2010, IEEE Transactions on Fuzzy Systems.

[6]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[7]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[8]  María José del Jesús,et al.  On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets , 2010, Inf. Sci..

[9]  Yuan-chin Ivan Chang,et al.  Boosting SVM Classifiers with Logistic Regression , 2003 .

[10]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[11]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[12]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[13]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[14]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[15]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[16]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[17]  Pablo M. Granitto,et al.  Weed seeds identification by machine vision , 2002 .

[18]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[19]  Ulf Brefeld,et al.  Support Vector Machines with Example Dependent Costs , 2003, ECML.

[20]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[21]  Ting Yu,et al.  Combine Vector Quantization and Support Vector Machine for Imbalanced Datasets , 2006, IFIP AI.

[22]  Da-Ren Yu,et al.  A Weighted Rough Set Method to Address the Class Imbalance Problem , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[23]  Yok-Yen Nguwi,et al.  An unsupervised self-organizing learning with support vector ranking for imbalanced datasets , 2010, Expert Syst. Appl..

[24]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[25]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[26]  Vasile Palade,et al.  A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems , 2009, 2009 International Conference on Machine Learning and Applications.

[27]  Yue-Shi Lee,et al.  Cluster-Based Sampling Approaches to Imbalanced Data Distributions , 2006, DaWaK.

[28]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[29]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[30]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.