论文信息 - Distance Matrix Pre-Caching and Distributed Computation of Internal Validation Indices in k-medoids Clustering - 字舞流文

Distance Matrix Pre-Caching and Distributed Computation of Internal Validation Indices in k-medoids Clustering

In this paper we discuss techniques for potential speedups in $k$-medoids clustering. Specifically, we address the advantages of pre-caching the pairwise distance matrix, heart of the $k$-medoids clustering algorithm, not only in order to speedup the execution of the algorithm itself, but also in order to speedup the evaluation of the well-known Silhouette Index and Davies-Bouldin Index for clusters’ validation. A major disadvantage of such pre-caching is that it might not be suitable for large datasets. To this end, a further contribution consists in proposing parallel and distributed implementations of both the Simplified Silhouette Index and the Davies-Bouldin Index for distributed $k$-clustering using the Apache Spark framework. Results on real-world pathway maps datasets show the robustness of such distributed implementations, also underlining their effectiveness for structured data.

Antonello Rizzi | Fabio Massimo Frattale Mascioli | Alessio Martino | F. Mascioli | A. Rizzi | A. Martino

[1] Ricardo J. G. B. Campello,et al. Evolving clusters in gene-expression data , 2006, Inf. Sci..

[2] Hae-Sang Park,et al. A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[3] P. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[4] Danilo Medeiros Eler,et al. Simplified Stress and Simplified Silhouette Coefficient to a Faster Quality Evaluation of Multidimensional Projection Techniques and Feature Spaces , 2015, 2015 19th International Conference on Information Visualisation.

[5] Ricardo J. G. B. Campello,et al. On the combination of relative clustering validity criteria , 2013, SSDBM.

[6] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[7] John D. Kelleher,et al. An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity , 2017, MLDM.

[8] P. Tseng,et al. Statistical Data Analysis Based on the L1-Norm and Related Methods , 2002 .

[9] Eduardo R. Hruschka,et al. Towards improving cluster-based feature selection with a simplified silhouette filter , 2011, Inf. Sci..

[10] Lorenzo Livi,et al. On the Problem of Modeling Structured Data with the MinSOD Representative , 2014 .

[11] Alessandro Giuliani,et al. Metabolic pathways variability and sequence/networks comparisons , 2006, BMC Bioinformatics.

[12] Donald W. Bouldin,et al. A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Peter J. Rousseeuw,et al. Clustering by means of medoids , 1987 .

[14] A. Giuliani,et al. Granular Computing Techniques for Bioinformatics Pattern Recognition Problems in Non-metric Spaces , 2018 .

[15] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[16] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[17] Hiroyuki Ogata,et al. KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[18] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[19] Paul S. Bradley,et al. Clustering via Concave Minimization , 1996, NIPS.

[20] Antonello Rizzi,et al. Efficient Approaches for Solving the Large-Scale k-medoids Problem , 2017, IJCCI.