论文信息 - Using Pivots to Speed-Up k-Medoids Clustering - 字舞流文

Using Pivots to Speed-Up k-Medoids Clustering

Clustering is a key technique within the KDD process, with k-means, and the more general k-medoids, being well-known incremental partition-based clustering algorithms. A fundamental issue within this class of algorithms is to find an initial set of medians (or medoids) that improves the efficiency of the algorithms (e.g., accelerating its convergence to a solution), at the same time that it improves its effectiveness (e.g., finding more meaningful clusters). Thus, in this article we aim at providing a technique that, given a set of elements, quickly finds a very small number of elements as medoid candidates for this set, allowing to improve both the efficiency and effectiveness of existing clustering algorithms. We target the class of k-medoids algorithms in general, and propose a technique that selects a well-positioned subset of central elements to serve as the initial set of medoids for the clustering process. Our technique leads to a substantially smaller amount of distance calculations, thus improving the algorithm's efficiency when compared to existing methods, without sacrificing effectiveness. A salient feature of our proposed technique is that it is not a new k-medoid clustering algorithm per se, rather, it can be used in conjunction with any existing clustering algorithm that is based on the k-medoid paradigm. Experimental results, using both synthetic and real datasets, confirm the efficiency, effectiveness and scalability of the proposed technique.

Caetano Traina | Mario A. Nascimento | Adriano Arantes Paterlini | M. Nascimento | C. Traina | A. Paterlini

[1] Qiaoping Zhang,et al. A New and Efficient K-Medoid Algorithm for Spatial Clustering , 2005, ICCSA.

[2] Thanasis Hadzilacos,et al. Advances in Spatial and Temporal Databases , 2015, Lecture Notes in Computer Science.

[3] Hans-Peter Kriegel,et al. The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[4] J. A. Hartigan,et al. A k-means clustering algorithm , 1979 .

[5] Kyriakos Mouratidis,et al. Medoid Queries in Large Spatial Databases , 2005, SSTD.

[6] Peter J. Rousseeuw,et al. Clustering by means of medoids , 1987 .

[7] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8] Jiawei Han,et al. Data Mining: Concepts and Techniques , 2000 .

[9] Robin Sibson,et al. SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[10] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[11] Jiawei Han,et al. Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[12] Pavel Zezula,et al. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[13] Arnold W. M. Smeulders,et al. The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[14] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[15] Hans-Peter Kriegel,et al. Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification , 1995, SSD.

[16] Peter J. Rousseeuw,et al. CLUSTERING LARGE DATA SETS , 1986 .

[17] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[18] Hanan Samet,et al. Properties of Embedding Methods for Similarity Searching in Metric Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[19] Agma J. M. Traina,et al. The Metric Histogram: A New and Efficient Approach for Content-based Image Retrieval , 2002, VDB.

[20] Hae-Sang Park,et al. A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[21] Jeng-Shyang Pan,et al. An Efficient K -Medoids-Based Algorithm Using Previous Medoid Index, Triangular Inequality Elimination Criteria, and Partial Distance Search , 2002, DaWaK.