论文信息 - K-means properties on six clustering benchmark datasets

K-means properties on six clustering benchmark datasets

This paper has two contributions. First, we introduce a clustering basic benchmark. Second, we study the performance of k-means using this benchmark. Specifically, we measure how the performance depends on four factors: (1) overlap of clusters, (2) number of clusters, (3) dimensionality, and (4) unbalance of cluster sizes. The results show that overlap is critical, and that k-means starts to work effectively when the overlap reaches 4% level.

Pasi Fränti | Sami Sieranoja | P. Fränti | Sami Sieranoja

[1] Teofilo F. GONZALEZ,et al. Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2] Santosh S. Vempala,et al. A discriminative framework for clustering via similarity functions , 2008, STOC.

[3] Alexandros Nanopoulos,et al. Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[4] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[5] Fabien Moutarde,et al. U*F clustering: a new performant "cluster-mining" method based on segmentation of Self-Organizing Maps , 2005 .

[6] Paul S. Bradley,et al. Refining Initial Points for K-Means Clustering , 1998, ICML.

[7] Hava T. Siegelmann,et al. Support Vector Clustering , 2002, J. Mach. Learn. Res..

[8] Hui Xiong,et al. K-means clustering versus validation measures: a data distribution perspective , 2006, KDD '06.

[9] Pasi Fränti. Efficiency of random swap clustering , 2018, Journal of Big Data.

[10] Vipin Kumar,et al. The Challenges of Clustering High Dimensional Data , 2004 .

[11] Pasi Fr. Genetic algorithm with deterministic crossover for vector quantization , 2000 .

[12] Tomi Kinnunen,et al. Comparison of clustering methods: A case study of text-independent speaker modeling , 2011, Pattern Recognit. Lett..

[13] Ranjan Maitra,et al. Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[14] Dunja Mladenic,et al. The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[15] Gonzalo Navarro,et al. A Probabilistic Spell for the Curse of Dimensionality , 2001, ALENEX.

[16] Tian Zhang,et al. BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[17] Pasi Fränti,et al. Centroid index: Cluster level similarity measure , 2014, Pattern Recognit..

[18] Pasi Fränti,et al. Randomised Local Search Algorithm for the Clustering Problem , 2000, Pattern Analysis & Applications.

[19] Charu C. Aggarwal,et al. On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[20] S. Dasgupta. The hardness of k-means clustering , 2008 .

[21] J. A. Hartigan,et al. A k-means clustering algorithm , 1979 .

[22] Jiye Liang,et al. The $K$-Means-Type Algorithms Versus Imbalanced Data Distributions , 2012, IEEE Transactions on Fuzzy Systems.

[23] Pasi Fränti,et al. XNN Graph , 2016, S+SSPR.

[24] Pierre L'Ecuyer,et al. Efficient and portable combined Tausworthe random number generators , 1990, TOMC.

[25] Isabelle Guyon,et al. Clustering: Science or Art? , 2009, ICML Unsupervised and Transfer Learning.

[26] Michael J. Brusco,et al. Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques , 2007, J. Classif..

[27] Murat Erisoglu,et al. A new algorithm for initial cluster centers in k-means algorithm , 2011, Pattern Recognit. Lett..

[28] Pasi Fränti,et al. A Dynamic local search algorithm for the clustering problem , 2002 .

[29] M. Narasimha Murty,et al. Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[30] Ravishankar Krishnaswamy,et al. The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[31] Douglas Steinley,et al. Local optima in K-means clustering: what you don't know may hurt you. , 2003, Psychological methods.

[32] G H Ball,et al. A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[33] Pasi Fränti,et al. Iterative shrinking method for clustering problems , 2006, Pattern Recognit..

[34] Lawrence Hubert,et al. The variance of the adjusted Rand index. , 2016, Psychological methods.

[35] Andrew W. Moore,et al. X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[36] Pasi Fränti,et al. Random Projection for k-means Clustering , 2018, ICAISC.

[37] Shuai Cheng Li,et al. A PTAS For The k-Consensus Structures Problem Under Squared Euclidean Distance , 2008, Algorithms.

[38] Deniz Yuret,et al. Locally Scaled Density Based Clustering , 2007, ICANNGA.

[39] Daniel A. Keim,et al. Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[40] Pasi Fränti,et al. Set Matching Measures for External Cluster Validity , 2016, IEEE Transactions on Knowledge and Data Engineering.

[41] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[42] Anil K. Jain. Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[43] Meena Mahajan,et al. The Planar k-means Problem is NP-hard I , 2009 .

[44] J. E. Hirsch,et al. An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[45] Pasi Fränti,et al. WB-index: A sum-of-squares based index for cluster validity , 2014, Data Knowl. Eng..

[46] Charles D. Mallah,et al. PLANT LEAF CLASSIFICATION USING PROBABILISTIC INTEGRATION OF SHAPE, TEXTURE AND MARGIN FEATURES , 2013 .

[47] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[48] Pedro Larrañaga,et al. An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[49] Jonathan Goldstein,et al. When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[50] Pasi Fränti,et al. Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51] E. Forgy. Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[52] Ling Huang,et al. Fast approximate spectral clustering , 2009, KDD.

[53] Li Zhang,et al. Feature clustering based support vector machine recursive feature elimination for gene selection , 2018, Applied Intelligence.

[54] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[55] Dimitrios Gunopulos,et al. Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[56] Pasi Fränti,et al. Dynamic Local Search for Clustering with Unknown Number of Clusters , 2002, ICPR.

[57] Sylvain Chartier,et al. The k-means clustering technique: General considerations and implementation in Mathematica , 2013 .

[58] Moh'd Belal Al Zoubi,et al. An Efficient Approach for Computing Silhouette Coefficients , 2008 .

[59] Shanlin Yang,et al. Exploring the uniform effect of FCM clustering: A data distribution perspective , 2016, Knowl. Based Syst..

[60] David M. Mount,et al. A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[61] Eamonn J. Keogh. Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[62] Simina Brânzei,et al. Weighted Clustering , 2011, AAAI.

[63] E. M. Wright,et al. Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.