Distance Matrix Pre-Caching and Distributed Computation of Internal Validation Indices in k-medoids Clustering

In this paper we discuss techniques for potential speedups in $k$-medoids clustering. Specifically, we address the advantages of pre-caching the pairwise distance matrix, heart of the $k$-medoids clustering algorithm, not only in order to speedup the execution of the algorithm itself, but also in order to speedup the evaluation of the well-known Silhouette Index and Davies-Bouldin Index for clusters’ validation. A major disadvantage of such pre-caching is that it might not be suitable for large datasets. To this end, a further contribution consists in proposing parallel and distributed implementations of both the Simplified Silhouette Index and the Davies-Bouldin Index for distributed $k$-clustering using the Apache Spark framework. Results on real-world pathway maps datasets show the robustness of such distributed implementations, also underlining their effectiveness for structured data.

[1]  Ricardo J. G. B. Campello,et al.  Evolving clusters in gene-expression data , 2006, Inf. Sci..

[2]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[3]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[4]  Danilo Medeiros Eler,et al.  Simplified Stress and Simplified Silhouette Coefficient to a Faster Quality Evaluation of Multidimensional Projection Techniques and Feature Spaces , 2015, 2015 19th International Conference on Information Visualisation.

[5]  Ricardo J. G. B. Campello,et al.  On the combination of relative clustering validity criteria , 2013, SSDBM.

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  John D. Kelleher,et al.  An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity , 2017, MLDM.

[8]  P. Tseng,et al.  Statistical Data Analysis Based on the L1-Norm and Related Methods , 2002 .

[9]  Eduardo R. Hruschka,et al.  Towards improving cluster-based feature selection with a simplified silhouette filter , 2011, Inf. Sci..

[10]  Lorenzo Livi,et al.  On the Problem of Modeling Structured Data with the MinSOD Representative , 2014 .

[11]  Alessandro Giuliani,et al.  Metabolic pathways variability and sequence/networks comparisons , 2006, BMC Bioinformatics.

[12]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[14]  A. Giuliani,et al.  Granular Computing Techniques for Bioinformatics Pattern Recognition Problems in Non-metric Spaces , 2018 .

[15]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[16]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[17]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  Paul S. Bradley,et al.  Clustering via Concave Minimization , 1996, NIPS.

[20]  Antonello Rizzi,et al.  Efficient Approaches for Solving the Large-Scale k-medoids Problem , 2017, IJCCI.