K-means clustering based SVM ensemble methods for imbalanced data problem

When the number of data in one class is significantly larger or less than the data in other class, under machine learning algorithm for classification, a problem of learning generalization occurs to the specific class and this is called imbalanced data problem. In this paper, we propose a novel method to solve the imbalanced data problem. We first divide data into clusters using K-means clustering algorithm and create classifier using the Support Vector Machine (SVM) method on each cluster. Before making classifier for each cluster, we are balancing the data for each cluster using data sampling techniques. After all classifiers are made for each cluster, we validate each classifier's performance using validation data. Final classification result would be calculated using the test data by aggregating all the cluster's classification results. We are using not only the results from the classifiers in each clusters, but also the credit of each classifier and data membership to each cluster. We have verified that the proposed classification method shows better performance than the existing machine learning algorithms for imbalanced data classification problem.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Francisco Herrera,et al.  Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution , 2011, HAIS.

[3]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[4]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[5]  Kyung Mi Lee,et al.  Statistical cluster validity indexes to consider cohesion and separation , 2012, 2012 International conference on Fuzzy Theory and Its Applications (iFUZZY2012).

[6]  Kyung Mi Lee,et al.  Efficient Identification of Frequent Family Subtrees in Tree Database , 2012 .

[7]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[8]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[9]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[10]  Jee-Hyong Lee,et al.  A music recommendation system with a dynamic k-means clustering algorithm , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[11]  Jee-Hyong Lee,et al.  An efficient prediction for heavy rain from big weather data using genetic algorithm , 2014, ICUIMC.

[12]  Chidchanok Lursinsap,et al.  Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques , 2013, Pattern Recognit. Lett..

[13]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[14]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .