Clustering refinement

Advanced validation of cluster analysis is expected to increase confidence and allow reliable implementations. In this work, we describe and test CluReAL, an algorithm for refining clustering irrespective of the method used in the first place. Moreover, we present ideograms that enable summarizing and properly interpreting problem spaces that have been clustered. The presented techniques are built on absolute cluster validity indices. Experiments cover a wide variety of scenarios and six of the most popular clustering techniques. Results show the potential of CluReAL for enhancing clustering and the suitability of ideograms to understand the context of the data through the lens of the cluster analysis. Refinement and interpretability are both crucial to reduce failure and increase performance control and operational awareness in unsupervised analysis.

[1]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[2]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[3]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[5]  Fionn Murtagh,et al.  Counting dendrograms: A survey , 1984, Discret. Appl. Math..

[6]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[7]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[8]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[9]  M. Ankerst,et al.  OPTICS: ordering points to identify the clustering structure , 1999, ACM SIGMOD Conference.

[10]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[12]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[13]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[14]  Jayanta Basak,et al.  Interpretable hierarchical clustering by constructing an unsupervised decision tree , 2005, IEEE Transactions on Knowledge and Data Engineering.

[15]  Witold Pedrycz,et al.  Enhancement of fuzzy clustering by mechanisms of partial supervision , 2006, Fuzzy Sets Syst..

[16]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[17]  Pasi Fränti,et al.  Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Pasi Fränti,et al.  Iterative shrinking method for clustering problems , 2006, Pattern Recognit..

[19]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[20]  V. Raykar,et al.  Fast Computation of Kernel Estimators , 2010 .

[21]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010, Stat. Anal. Data Min..

[22]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[23]  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[24]  Boris G. Mirkin,et al.  Choosing the number of clusters , 2011, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[25]  Arthur Zimek,et al.  A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies , 2013, Data Mining and Knowledge Discovery.

[26]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[27]  Nan Liu,et al.  Knowledge Acquisition and Representation Using Fuzzy Evidential Reasoning and Dynamic Adaptive Fuzzy Petri Nets , 2013, IEEE Transactions on Cybernetics.

[28]  Arthur Zimek,et al.  Density-Based Clustering Validation , 2014, SDM.

[29]  Arthur Zimek,et al.  Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection , 2015, ACM Trans. Knowl. Discov. Data.

[30]  Pasi Fränti,et al.  Set Matching Measures for External Cluster Validity , 2016, IEEE Transactions on Knowledge and Data Engineering.

[31]  I. S. Sitanggang,et al.  Determination of Optimal Epsilon (Eps) Value on DBSCAN Algorithm to Clustering Data on Peatland Hotspots in Sumatra , 2016 .

[32]  Leland McInnes,et al.  hdbscan: Hierarchical density based clustering , 2017, J. Open Source Softw..

[33]  Petros Xanthopoulos,et al.  Estimating the number of clusters in a dataset via consensus clustering , 2019, Expert Syst. Appl..

[34]  Arthur Zimek,et al.  MDCGen: Multidimensional Dataset Generator for Clustering , 2019, J. Classif..

[35]  Arthur Zimek,et al.  Absolute Cluster Validity , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.