K-means properties on six clustering benchmark datasets

This paper has two contributions. First, we introduce a clustering basic benchmark. Second, we study the performance of k-means using this benchmark. Specifically, we measure how the performance depends on four factors: (1) overlap of clusters, (2) number of clusters, (3) dimensionality, and (4) unbalance of cluster sizes. The results show that overlap is critical, and that k-means starts to work effectively when the overlap reaches 4% level.

[1]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2]  Santosh S. Vempala,et al.  A discriminative framework for clustering via similarity functions , 2008, STOC.

[3]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[4]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[5]  Fabien Moutarde,et al.  U*F clustering: a new performant "cluster-mining" method based on segmentation of Self-Organizing Maps , 2005 .

[6]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[7]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[8]  Hui Xiong,et al.  K-means clustering versus validation measures: a data distribution perspective , 2006, KDD '06.

[9]  Pasi Fränti Efficiency of random swap clustering , 2018, Journal of Big Data.

[10]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[11]  Pasi Fr Genetic algorithm with deterministic crossover for vector quantization , 2000 .

[12]  Tomi Kinnunen,et al.  Comparison of clustering methods: A case study of text-independent speaker modeling , 2011, Pattern Recognit. Lett..

[13]  Ranjan Maitra,et al.  Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[14]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[15]  Gonzalo Navarro,et al.  A Probabilistic Spell for the Curse of Dimensionality , 2001, ALENEX.

[16]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[17]  Pasi Fränti,et al.  Centroid index: Cluster level similarity measure , 2014, Pattern Recognit..

[18]  Pasi Fränti,et al.  Randomised Local Search Algorithm for the Clustering Problem , 2000, Pattern Analysis & Applications.

[19]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[20]  S. Dasgupta The hardness of k-means clustering , 2008 .

[21]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[22]  Jiye Liang,et al.  The $K$-Means-Type Algorithms Versus Imbalanced Data Distributions , 2012, IEEE Transactions on Fuzzy Systems.

[23]  Pasi Fränti,et al.  XNN Graph , 2016, S+SSPR.

[24]  Pierre L'Ecuyer,et al.  Efficient and portable combined Tausworthe random number generators , 1990, TOMC.

[25]  Isabelle Guyon,et al.  Clustering: Science or Art? , 2009, ICML Unsupervised and Transfer Learning.

[26]  Michael J. Brusco,et al.  Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques , 2007, J. Classif..

[27]  Murat Erisoglu,et al.  A new algorithm for initial cluster centers in k-means algorithm , 2011, Pattern Recognit. Lett..

[28]  Pasi Fränti,et al.  A Dynamic local search algorithm for the clustering problem , 2002 .

[29]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[30]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[31]  Douglas Steinley,et al.  Local optima in K-means clustering: what you don't know may hurt you. , 2003, Psychological methods.

[32]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[33]  Pasi Fränti,et al.  Iterative shrinking method for clustering problems , 2006, Pattern Recognit..

[34]  Lawrence Hubert,et al.  The variance of the adjusted Rand index. , 2016, Psychological methods.

[35]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[36]  Pasi Fränti,et al.  Random Projection for k-means Clustering , 2018, ICAISC.

[37]  Shuai Cheng Li,et al.  A PTAS For The k-Consensus Structures Problem Under Squared Euclidean Distance , 2008, Algorithms.

[38]  Deniz Yuret,et al.  Locally Scaled Density Based Clustering , 2007, ICANNGA.

[39]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[40]  Pasi Fränti,et al.  Set Matching Measures for External Cluster Validity , 2016, IEEE Transactions on Knowledge and Data Engineering.

[41]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[42]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[43]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[44]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[45]  Pasi Fränti,et al.  WB-index: A sum-of-squares based index for cluster validity , 2014, Data Knowl. Eng..

[46]  Charles D. Mallah,et al.  PLANT LEAF CLASSIFICATION USING PROBABILISTIC INTEGRATION OF SHAPE, TEXTURE AND MARGIN FEATURES , 2013 .

[47]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[48]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[49]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[50]  Pasi Fränti,et al.  Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  E. Forgy Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[52]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[53]  Li Zhang,et al.  Feature clustering based support vector machine recursive feature elimination for gene selection , 2018, Applied Intelligence.

[54]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[55]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[56]  Pasi Fränti,et al.  Dynamic Local Search for Clustering with Unknown Number of Clusters , 2002, ICPR.

[57]  Sylvain Chartier,et al.  The k-means clustering technique: General considerations and implementation in Mathematica , 2013 .

[58]  Moh'd Belal Al Zoubi,et al.  An Efficient Approach for Computing Silhouette Coefficients , 2008 .

[59]  Shanlin Yang,et al.  Exploring the uniform effect of FCM clustering: A data distribution perspective , 2016, Knowl. Based Syst..

[60]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[61]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[62]  Simina Brânzei,et al.  Weighted Clustering , 2011, AAAI.

[63]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.