On Strategies to Fix Degenerate k-means Solutions

Abstractk-means is a benchmark algorithm used in cluster analysis. It belongs to the large category of heuristics based on location-allocation steps that alternately locate cluster centers and allocate data points to them until no further improvement is possible. Such heuristics are known to suffer from a phenomenon called degeneracy in which some of the clusters are empty. In this paper, we compare and propose a series of strategies to circumvent degenerate solutions during a k-means execution. Our computational experiments show that these strategies are effective, leading to better clustering solutions in the vast majority of the cases in which degeneracy appears in k-means. Moreover, we compare the use of our fixing strategies within k-means against the use of two initialization methods found in the literature. These results demonstrate how useful the proposed strategies can be, specially inside memorybased clustering algorithms.

[1]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[2]  Iven Van Mechelen,et al.  On the Added Value of Bootstrap Analysis for K-Means Clustering , 2015, Journal of Classification.

[3]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[4]  Paul E. Green,et al.  A Computational Study of Replicated Clustering with an Application to Market Segmentation , 1991 .

[5]  Leon Cooper,et al.  Heuristic Methods for Location-Allocation Problems , 1964 .

[6]  Pierre Hansen,et al.  An improved column generation algorithm for minimum sum-of-squares clustering , 2009, Math. Program..

[7]  W. DeSarbo,et al.  The Heterogeneous P-Median Problem for Categorization Based Clustering , 2012, Psychometrika.

[8]  Michael J. Brusco,et al.  Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques , 2007, J. Classif..

[9]  Xindong Wu,et al.  The Top Ten Algorithms in Data Mining , 2009 .

[10]  Adil M. Bagirov,et al.  A heuristic algorithm for solving the minimum sum-of-squares clustering problems , 2015, J. Glob. Optim..

[11]  Anna Choromanska,et al.  Online Clustering with Experts , 2012, AISTATS.

[12]  Rebecca Nugent,et al.  Skill Set Profile Clustering: The Empty K-Means Algorithm with Automatic Specification of Starting Cluster Centers , 2010, EDM.

[13]  Le Thi Hoai An,et al.  New and efficient DCA based algorithms for minimum sum-of-squares clustering , 2014, Pattern Recognit..

[14]  J. Wolpaw,et al.  Clinical Applications of Brain-Computer Interfaces: Current State and Future Prospects , 2009, IEEE Reviews in Biomedical Engineering.

[15]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[16]  M. Brusco,et al.  A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning , 2007 .

[17]  C. A. Haverly Studies of the behavior of recursion for the pooling problem , 1978, SMAP.

[18]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[19]  Joaquín A. Pacheco,et al.  Design of hybrids for the minimum sum-of-squares clustering problem , 2003, Comput. Stat. Data Anal..

[20]  Pierre Hansen,et al.  New heuristic for harmonic means clustering , 2015, J. Glob. Optim..

[21]  Marc Teboulle,et al.  A Unified Continuous Optimization Framework for Center-Based Clustering Methods , 2007, J. Mach. Learn. Res..

[22]  Douglas Steinley,et al.  K-means clustering: a half-century synthesis. , 2006, The British journal of mathematical and statistical psychology.

[23]  Pierre Hansen,et al.  J-MEANS: a new local search heuristic for minimum sum of squares clustering , 1999, Pattern Recognit..

[24]  Nenad Mladenovic,et al.  Degeneracy in the multi-source Weber problem , 1999, Math. Program..

[25]  Pierre Hansen,et al.  Analysis of Global k-Means, an Incremental Heuristic for Minimum Sum-of-Squares Clustering , 2005, J. Classif..

[26]  Nicos Christofides,et al.  Distribution management : mathematical modelling and practical analysis , 1971 .

[27]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[28]  Yue Zhao,et al.  Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup , 2015, ICML.

[29]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[30]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[31]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[32]  Jean Ponce,et al.  Sparse Modeling for Image and Vision Processing , 2014, Found. Trends Comput. Graph. Vis..

[33]  Enrique H. Ruspini,et al.  Numerical methods for fuzzy clustering , 1970, Inf. Sci..

[34]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.