Dynamic Load Balancing in Parallel KD-Tree k-Means

One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.

[1]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[2]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[3]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[4]  Anil K. Jain,et al.  Large-scale parallel data clustering , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[5]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[8]  Sariel Har-Peled,et al.  How Fast Is the k-Means Method? , 2005, SODA '05.

[9]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[10]  Domenico Talia,et al.  Scalable Parallel Clustering for Data Mining on Multicomputers , 2000, IPDPS Workshops.

[11]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[12]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[13]  Srinivas Aluru,et al.  Parallel construction of multidimensional binary search trees , 1996, ICS '96.

[14]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[15]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[16]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[17]  Michael S Lewis-Beck,et al.  Sage university papers. Series Quantitative applications in the social sciences , 2008 .

[18]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[19]  Srinivas Aluru,et al.  Parallel construction of multidimensional binary search trees , 2000, ICS '96.

[20]  Mark Baker,et al.  MPJ Express: Towards Thread Safe Java HPC , 2006, 2006 IEEE International Conference on Cluster Computing.

[21]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[22]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.