Robust partitional clustering by outlier and density insensitive seeding

The leading partitional clustering technique, k-means, is one of the most computationally efficient clustering methods. However, it produces a local optimal solution that strongly depends on its initial seeds. Bad initial seeds can also cause the splitting or merging of natural clusters even if the clusters are well separated. In this paper, we propose, ROBIN, a novel method for initial seed selection in k-means types of algorithms. It imposes constraints on the chosen seeds that lead to better clustering when k-means converges. The constraints make the seed selection method insensitive to outliers in the data and also assist it to handle variable density or multi-scale clusters. Furthermore, they (constraints) make the method deterministic, so only one run suffices to obtain good initial seeds, as opposed to traditional random seed selection approaches that need many runs to obtain good seeds that lead to satisfactory clustering. We did a comprehensive evaluation of ROBIN against state-of-the-art seeding methods on a wide range of synthetic and real datasets. We show that ROBIN consistently outperforms existing approaches in terms of the clustering quality.

[1]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[2]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[5]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[6]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[7]  James M. Keller,et al.  Fuzzy Models and Algorithms for Pattern Recognition and Image Processing , 1999 .

[8]  M M Astrahan SPEECH ANALYSIS BY CLUSTERING, OR THE HYPERPHONEME METHOD , 1970 .

[9]  Charles J. Colbourn,et al.  Unit disk graphs , 1991, Discret. Math..

[10]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[11]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[12]  Shehroz S. Khan,et al.  Cluster center initialization algorithm for K-means clustering , 2004, Pattern Recognit. Lett..

[13]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[14]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[15]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[16]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[18]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[19]  Hanan Samet,et al.  Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) , 2005 .

[20]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[22]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[23]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[24]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[25]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[26]  Hichem Frigui,et al.  Clustering by competitive agglomeration , 1997, Pattern Recognit..

[27]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[28]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[29]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[30]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[31]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[32]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[33]  C. J. Jardine,et al.  The structure and construction of taxonomic hierarchies , 1967 .

[34]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.