Imbalanced K-Means : An algorithm to cluster imbalanced-distributed data

K-means is a partitional clustering technique that iswell-known and widely used for its low computational cost. However, the performance of k-means algorithm tends to beaffected by skewed data distributions, i.e., imbalanced data. Theyoften produce clusters of relatively uniform sizes, even if input datahave varied a cluster size, which is called the “uniform effect.” Inthis paper, we analyze the causes of this effect and illustrate thatit probably occurs more in the k-means clustering process. As the minority class decreases in size, the “uniform effect” becomes evident. To prevent theeffect of the “uniform effect”, we revisit the well-known K-means algorithmand provide a general method to properly cluster imbalance distributed data. We present Imbalanced K-Means (IKM), a multi-purpose partitional clustering procedure that minimizes the clustering sum of squared error criterion, while imposing a hard sequentiality constraint in theclustering step. The proposed algorithm consists of a novel oversampling technique implemented by removing noisy and weak instances from both majority and minority classes and then oversampling only novel minority instances. We conduct experiments using twelve UCI datasets from various application domains using fivealgorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  Jian Yu,et al.  Optimality test for generalized FCM and its application to parameter selection , 2005, IEEE Transactions on Fuzzy Systems.

[3]  Wei-Zhen Lu,et al.  Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. , 2008, The Science of the total environment.

[4]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[5]  Haiqiao Huang,et al.  A robust adaptive clustering analysis method for automatic identification of clusters , 2012, Pattern Recognit..

[6]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[7]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[8]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[9]  James C. Bezdek,et al.  Efficient Implementation of the Fuzzy c-Means Clustering Algorithms , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[11]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Xiang Peng,et al.  Robust BMPM training based on second-order cone programming and its application in medical diagnosis , 2008, Neural Networks.

[13]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[14]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[15]  Christos Bouras,et al.  A clustering technique for news articles using WordNet , 2012, Knowl. Based Syst..

[16]  Luis A. Leiva,et al.  Warped K-Means: An algorithm to cluster sequentially-distributed data , 2013, Inf. Sci..

[17]  Pavel Brazdil,et al.  Cost-Sensitive Decision Trees Applied to Medical Data , 2007, DaWaK.

[18]  Junjie Wu,et al.  Towards information-theoretic K-means clustering for image indexing , 2013, Signal Process..

[19]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[20]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[21]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[22]  Shian-Shyong Tseng,et al.  Two-phase clustering process for outliers detection , 2001, Pattern Recognit. Lett..

[23]  Jiang-She Zhang,et al.  Robust clustering by pruning outliers , 2003, IEEE Trans. Syst. Man Cybern. Part B.

[24]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[25]  Jiye Liang,et al.  A Framework for Clustering Categorical Time-Evolving Data , 2010, IEEE Transactions on Fuzzy Systems.

[26]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[27]  Michael K. Ng,et al.  Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.

[28]  Fionn Murtagh,et al.  Clustering in massive data sets , 2002 .

[29]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[30]  Michael J. Laszlo,et al.  A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Amutha Prabakar Muniyandi,et al.  Network Anomaly Detection by Cascading K-Means Clustering and C4.5 Decision Tree algorithm , 2012 .

[32]  Kemal Kilic,et al.  Comparison of Different Strategies of Utilizing Fuzzy Clustering in Structure Identification , 2007, Inf. Sci..

[33]  Randy H. Moss,et al.  A methodological approach to the classification of dermoscopy images , 2007, Comput. Medical Imaging Graph..

[34]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[35]  Max Mignotte A de-texturing and spatially constrained K-means approach for image segmentation , 2011, Pattern Recognit. Lett..

[36]  Jim Z. C. Lai,et al.  Fast global k-means clustering using cluster membership and inequality , 2010, Pattern Recognit..

[37]  Yang Fan,et al.  Exploring of clustering algorithm on class-imbalanced data , 2013, 2013 8th International Conference on Computer Science & Education.

[38]  Witold Pedrycz,et al.  Positive approximation: An accelerator for attribute reduction in rough set theory , 2010, Artif. Intell..

[39]  J. Bezdek A Physical Interpretation of Fuzzy ISODATA , 1993 .

[40]  Xudong Jiang,et al.  A multi-prototype clustering algorithm , 2009, Pattern Recognit..

[41]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[42]  Michalis Vazirgiannis,et al.  A density-based cluster validity approach using multi-representatives , 2008, Pattern Recognit. Lett..

[43]  Jiye Liang,et al.  An initialization method for the K-Means algorithm using neighborhood model , 2009, Comput. Math. Appl..

[44]  Aoying Zhou,et al.  Distributed Data Stream Clustering: A Fast EM-based Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[45]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[46]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[47]  Adil M. Bagirov,et al.  Modified global k-means algorithm for minimum sum-of-squares clustering problems , 2008, Pattern Recognit..