Initialization of K-modes clustering using outlier detection techniques

We considered the initialization of K-modes clustering from the view of outlier detection.We proposed an initialization algorithm for K-modes clustering via the distance-based outlier detection technique.We presented a partition entropy-based outlier detection technique, and designed an initialization algorithm via it.We proposed a new distance metric - weighted matching distance metric.The effectiveness of our initialization algorithms was shown on several UCI data sets. The K-modes clustering has received much attention, since it works well for categorical data sets. However, the performance of K-modes clustering is especially sensitive to the selection of initial cluster centers. Therefore, choosing the proper initial cluster centers is a key step for K-modes clustering. In this paper, we consider the initialization of K-modes clustering from the view of outlier detection. We present two different initialization algorithms for K-modes clustering, where the first is based on the traditional distance-based outlier detection technique, and the second is based on the partition entropy-based outlier detection technique. By using the above two outlier detection techniques to calculate the degree of outlierness of each object, our algorithms can guarantee that the chosen initial cluster centers are not outliers. Moreover, during the process of initialization, we adopt a new distance metric - weighted matching distance metric, to calculate the distance between two objects described by categorical attributes. Experimental results on several UCI data sets demonstrate the effectiveness of our initialization algorithms for K-modes clustering.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Jiye Liang,et al.  Fast global k-means clustering based on local geometrical information , 2013, Inf. Sci..

[3]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[4]  C. A. Murthy,et al.  Density-Based Multiscale Data Condensation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Jiye Liang,et al.  Determining the number of clusters using information entropy for mixed data , 2012, Pattern Recognit..

[6]  Aristides Gionis,et al.  k-means-: A Unified Approach to Clustering and Outlier Detection , 2013, SDM.

[7]  Chung-Chian Hsu,et al.  Hierarchical clustering of mixed data based on distance hierarchy , 2007, Inf. Sci..

[8]  Vladik Kreinovich,et al.  Handbook of Granular Computing , 2008 .

[9]  Qinghua Hu,et al.  Rank Entropy-Based Decision Trees for Monotonic Classification , 2012, IEEE Transactions on Knowledge and Data Engineering.

[10]  Qiang Shen,et al.  Computational Intelligence and Feature Selection - Rough and Fuzzy Approaches , 2008, IEEE Press series on computational intelligence.

[11]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[12]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[13]  A. Kusiak Information Entropy , 2006 .

[14]  Duoqian Miao,et al.  Outlier Detection Based on Granular Computing , 2008, RSCTC.

[15]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[16]  Witold Pedrycz,et al.  Knowledge-based clustering - from data to information granules , 2007 .

[17]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[18]  Joshua Zhexue Huang,et al.  A New Initialization Method for Clustering Categorical Data , 2007, PAKDD.

[19]  Feng Jiang,et al.  A relative decision entropy-based feature selection approach , 2015, Pattern Recognit..

[20]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[21]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[22]  Miao Duo,et al.  A HEURISTIC ALGORITHM FOR REDUCTION OF KNOWLEDGE , 1999 .

[23]  Shehroz S. Khan,et al.  Cluster center initialization algorithm for K-modes clustering , 2013, Expert Syst. Appl..

[24]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[25]  Ivo Diintsch Uncertainty measures of rough set prediction , 2003 .

[26]  Shehroz S. Khan,et al.  Computation of Initial Modes for K-modes Clustering Algorithm Using Evidence Accumulation , 2007, IJCAI.

[27]  Zhou,et al.  A Global K-modes Algorithm for Clustering Categorical Data ∗ , 2012 .

[28]  Jiye Liang,et al.  A New Method for Measuring the Uncertainty in Incomplete Information Systems , 2009, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[29]  Wang Guo,et al.  Decision Table Reduction based on Conditional Information Entropy , 2002 .

[30]  Theresa Beaubouef,et al.  Information-Theoretic Measures of Uncertainty for Rough Sets and Rough Relational Databases , 1998, Inf. Sci..

[31]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[32]  Liang Bai,et al.  A dissimilarity measure for the k-Modes clustering algorithm , 2012, Knowl. Based Syst..

[33]  Luigi Palopoli,et al.  Outlier detection for simple default theories , 2010, Artif. Intell..

[34]  Xu Zhang,et al.  A Quick Attribute Reduction Algorithm with Complexity of max(O(|C||U|),O(|C|~2|U/C|)) , 2006 .

[35]  Jiye Liang,et al.  An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data , 2011, Knowl. Based Syst..

[36]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Jiye Liang,et al.  A novel attribute weighting algorithm for clustering high-dimensional categorical data , 2011, Pattern Recognit..

[38]  Yiyu Yao,et al.  Granular Computing , 2008 .

[39]  Cungen Cao,et al.  A rough set approach to outlier detection , 2008, Int. J. Gen. Syst..

[40]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[41]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Jiye Liang,et al.  Space Structure and Clustering of Categorical Data , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[43]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[44]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[45]  Jiye Liang,et al.  A new initialization method for categorical data clustering , 2009, Expert Syst. Appl..

[46]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[47]  Zengyou He,et al.  Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering , 2006, ArXiv.

[48]  M. J. Wierman,et al.  MEASURING UNCERTAINTY IN ROUGH SET THEORY , 1999 .

[49]  Claudio Sartori,et al.  Distributed Strategies for Mining Outliers in Large Data Sets , 2013, IEEE Transactions on Knowledge and Data Engineering.

[50]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Cungen Cao,et al.  A hybrid approach to outlier detection based on boundary region , 2011, Pattern Recognit. Lett..

[52]  Sankar K. Pal,et al.  Granular computing, rough entropy and object extraction , 2005, Pattern Recognit. Lett..

[53]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[54]  Dominik Slezak,et al.  Approximate Entropy Reducts , 2002, Fundam. Informaticae.

[55]  Shehroz S. Khan,et al.  Computing Initial points using Density Based Multiscale Data Condensation for Clustering Categorical data , 2003 .

[56]  Jiye Liang,et al.  A new measure of uncertainty based on knowledge granulation for rough sets , 2009, Inf. Sci..

[57]  Zhengxin Chen,et al.  An iterative initial-points refinement algorithm for categorical data clustering , 2002, Pattern Recognit. Lett..

[58]  Jiye Liang,et al.  Information entropy, rough entropy and knowledge granulation in incomplete information systems , 2006, Int. J. Gen. Syst..

[59]  Jiye Liang,et al.  An initialization method for the K-Means algorithm using neighborhood model , 2009, Comput. Math. Appl..

[60]  Jiye Liang,et al.  The Information Entropy, Rough Entropy And Knowledge Granulation In Rough Set Theory , 2004, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[61]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[62]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[63]  Doheon Lee,et al.  Fuzzy clustering of categorical data using fuzzy centroids , 2004, Pattern Recognit. Lett..