Undersampled $$K$$K-means approach for handling imbalanced distributed data

Abstract$$K$$K-means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of $$K$$K-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the “uniform effect”. In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the $$K$$K-means clustering process. As the minority class decreases in size, the “uniform effect” becomes evident. To prevent the effect of the “uniform effect”, we revisit the well-known $$K$$K-means algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class. We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.

[1]  Christos Bouras,et al.  A clustering technique for news articles using WordNet , 2012, Knowl. Based Syst..

[2]  Ali Mirza Mahmood,et al.  Early Detection of Clinical Parameters in Heart Disease by Improved Decision Tree Algorithm , 2010, 2010 Second Vaagdevi International Conference on Information Technology for Real World Problems.

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  Xudong Jiang,et al.  A multi-prototype clustering algorithm , 2009, Pattern Recognit..

[6]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[7]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[8]  Wei-Zhen Lu,et al.  Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. , 2008, The Science of the total environment.

[9]  Hui Xiong,et al.  K-means clustering versus validation measures: a data distribution perspective , 2006, KDD '06.

[10]  Kemal Kilic,et al.  Comparison of Different Strategies of Utilizing Fuzzy Clustering in Structure Identification , 2007, Inf. Sci..

[11]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Moncef Gabbouj,et al.  Personalized long-term ECG classification: A systematic approach , 2011, Expert Syst. Appl..

[13]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[14]  Junjie Wu,et al.  Towards information-theoretic K-means clustering for image indexing , 2013, Signal Process..

[15]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[16]  Amutha Prabakar Muniyandi,et al.  Network Anomaly Detection by Cascading K-Means Clustering and C4.5 Decision Tree algorithm , 2012 .

[17]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[18]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[19]  Asad Mohsin,et al.  Hamilton, New Zealand , 2008 .

[20]  Shian-Shyong Tseng,et al.  Two-phase clustering process for outliers detection , 2001, Pattern Recognit. Lett..

[21]  Renato Cordeiro de Amorim,et al.  Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering , 2012, Pattern Recognit..

[22]  Philip M. Long,et al.  Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[23]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[24]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[25]  Haiqiao Huang,et al.  A robust adaptive clustering analysis method for automatic identification of clusters , 2012, Pattern Recognit..

[26]  Randy H. Moss,et al.  A methodological approach to the classification of dermoscopy images , 2007, Comput. Medical Imaging Graph..

[27]  Yang Fan,et al.  Exploring of clustering algorithm on class-imbalanced data , 2013, 2013 8th International Conference on Computer Science & Education.

[28]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[29]  Ali Mirza Mahmood,et al.  A novel pruning approach using expert knowledge for data-specific pruning , 2011, Engineering with Computers.

[30]  Luis A. Leiva,et al.  Warped K-Means: An algorithm to cluster sequentially-distributed data , 2013, Inf. Sci..

[31]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[32]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[33]  Pavel Brazdil,et al.  Cost-Sensitive Decision Trees Applied to Medical Data , 2007, DaWaK.

[34]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[35]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[36]  Xiang Peng,et al.  Robust BMPM training based on second-order cone programming and its application in medical diagnosis , 2008, Neural Networks.

[37]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[38]  Max Mignotte A de-texturing and spatially constrained K-means approach for image segmentation , 2011, Pattern Recognit. Lett..

[39]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[40]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.