论文信息 - Dynamic Load Balancing in Parallel KD-Tree k-Means

Dynamic Load Balancing in Parallel KD-Tree k-Means

One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.

Giuseppe Di Fatta | David Pettinger

[1] Inderjit S. Dhillon,et al. A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[2] Graham K. Rand,et al. Quantitative Applications in the Social Sciences , 1983 .

[3] Andrew W. Moore,et al. Efficient memory-based learning for robot control , 1990 .

[4] Anil K. Jain,et al. Large-scale parallel data clustering , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[5] D.M. Mount,et al. An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[7] M. Inaba. Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[8] Sariel Har-Peled,et al. How Fast Is the k-Means Method? , 2005, SODA '05.

[9] Sanjay Ranka,et al. An effic ient k-means clustering algorithm , 1997 .

[10] Domenico Talia,et al. Scalable Parallel Clustering for Data Mining on Multicomputers , 2000, IPDPS Workshops.

[11] Jon Louis Bentley,et al. Multidimensional binary search trees used for associative searching , 1975, CACM.

[12] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[13] Srinivas Aluru,et al. Parallel construction of multidimensional binary search trees , 1996, ICS '96.

[14] Mary Inaba,et al. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[15] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[16] Sergei Vassilvitskii,et al. How slow is the k-means method? , 2006, SCG '06.

[17] Michael S Lewis-Beck,et al. Sage university papers. Series Quantitative applications in the social sciences , 2008 .

[18] Paul S. Bradley,et al. Refining Initial Points for K-Means Clustering , 1998, ICML.

[19] Srinivas Aluru,et al. Parallel construction of multidimensional binary search trees , 2000, ICS '96.

[20] Mark Baker,et al. MPJ Express: Towards Thread Safe Java HPC , 2006, 2006 IEEE International Conference on Cluster Computing.

[21] Andrew W. Moore,et al. X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[22] Andrew W. Moore,et al. Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.