A time-efficient pattern reduction algorithm for k-means clustering

This paper presents an efficient algorithm, called pattern reduction (PR), for reducing the computation time of k-means and k-means-based clustering algorithms. The proposed algorithm works by compressing and removing at each iteration patterns that are unlikely to change their membership thereafter. Not only is the proposed algorithm simple and easy to implement, but it can also be applied to many other iterative clustering algorithms such as kernel-based and population-based clustering algorithms. Our experiments-from 2 to 1000 dimensions and 150 to 10,000,000 patterns-indicate that with a small loss of quality, the proposed algorithm can significantly reduce the computation time of all state-of-the-art clustering algorithms evaluated in this paper, especially for large and high-dimensional data sets.

[1]  Francesco Camastra,et al.  A Novel Kernel Method for Clustering , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Ujjwal Maulik,et al.  An evolutionary technique based on K-Means algorithm for optimal clustering in RN , 2002, Inf. Sci..

[3]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2005, WWW '05.

[4]  Ashish Ghosh,et al.  Aggregation pheromone density based data clustering , 2008, Inf. Sci..

[5]  Shian-Shyong Tseng,et al.  Two-phase clustering process for outliers detection , 2001, Pattern Recognit. Lett..

[6]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[7]  J. Kogan Introduction to Clustering Large and High-Dimensional Data , 2007 .

[8]  Haiyuan Wu,et al.  RK-Means Clustering: K-Means with Reliability , 2008, IEICE Trans. Inf. Syst..

[9]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Mu-Chun Su,et al.  Fast self-organizing feature map algorithm , 2000, IEEE Trans. Neural Networks Learn. Syst..

[12]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[13]  Andries Petrus Engelbrecht,et al.  Data clustering using particle swarm optimization , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[14]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[15]  Mao Ye,et al.  A tabu search approach for the minimum sum-of-squares clustering problem , 2008, Inf. Sci..

[16]  Ramiz M. Aliguliyev,et al.  Performance evaluation of density-based clustering methods , 2009, Inf. Sci..

[17]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[18]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[19]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[20]  Rong Zhang,et al.  A large scale clustering scheme for kernel K-Means , 2002, Object recognition supported by user interaction for service robots.

[21]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[22]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[23]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[24]  R. J. Kuo,et al.  Integration of self-organizing feature map and K-means algorithm for market segmentation , 2002, Comput. Oper. Res..

[25]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[26]  Antônio de Pádua Braga,et al.  SVM-KM: speeding SVMs learning with a priori cluster selection and k-means , 2000, Proceedings. Vol.1. Sixth Brazilian Symposium on Neural Networks.

[27]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[28]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[29]  Michael K. Ng,et al.  Clustering categorical data sets using tabu search techniques , 2002, Pattern Recognit..

[30]  Yi Lu,et al.  FGKA: a Fast Genetic K-means Clustering Algorithm , 2004, SAC '04.

[31]  G. De Soete,et al.  Clustering and Classification , 2019, Data-Driven Science and Engineering.

[32]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[33]  William H. E. Day,et al.  COMPLEXITY THEORY: AN INTRODUCTION FOR PRACTITIONERS OF CLASSIFICATION , 1996 .

[34]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[35]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[36]  Cheng-Fa Tsai,et al.  MSGKA: an efficient clustering algorithm for large databases , 2002, IEEE International Conference on Systems, Man and Cybernetics.

[37]  Soon Myoung Chung,et al.  Parallel bisecting k-means with prediction clustering algorithm , 2006, The Journal of Supercomputing.

[38]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[39]  Michael J. Laszlo,et al.  A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Li Wu,et al.  A Survey of Face Recognition , 2006 .

[41]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[42]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[43]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[44]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[45]  Thomas E. Potok,et al.  Document clustering using particle swarm optimization , 2005, Proceedings 2005 IEEE Swarm Intelligence Symposium, 2005. SIS 2005..

[46]  Lawrence O. Hall,et al.  Fast Accurate Fuzzy Clustering through Data Reduction , 2003 .

[47]  Carlos Ordonez,et al.  Efficient disk-based K-means clustering for relational databases , 2004, IEEE Transactions on Knowledge and Data Engineering.

[48]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[49]  Dino Pedreschi,et al.  WebCat: Automatic Categorization of Web Search Results , 2003, SEBD.

[50]  Yi Lu,et al.  Incremental genetic K-means algorithm and its application in gene expression data analysis , 2004, BMC Bioinformatics.

[51]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[52]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[53]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[54]  R. J. Kuo,et al.  Application of ant K-means on clustering analysis , 2005 .

[55]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[56]  Eytan Domany,et al.  Coupled Two-way Clustering Analysis of Breast Cancer and Colon Cancer Gene Expression Data , 2002, Bioinform..

[57]  Chi-Hoon Lee,et al.  Clustering high dimensional data: A graph-based relaxed optimization approach , 2008, Inf. Sci..

[58]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[59]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[60]  Michal Bereta,et al.  Immune K-means and negative selection algorithms for data analysis , 2009, Inf. Sci..