Efficient Group Communication for Large-Scale Parallel Clustering

Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in iterative parallel data mining algorithms. In particular, the analysis focuses on one of the most influential and popular data mining methods, the k-means algorithm for cluster analysis. The straightforward parallel formulation of the k-means algorithm requires a global reduction operation at each iteration step, which hinders its scalability. This work studies a different parallel formulation of the algorithm where the requirement of global communication can be relaxed while still providing the exact solution of the centralised k-means algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs.

[1]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[2]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[3]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[4]  Srinivas Aluru,et al.  Parallel construction of multidimensional binary search trees , 2000, ICS '96.

[5]  Mark Baker,et al.  MPJ Express: Towards Thread Safe Java HPC , 2006, 2006 IEEE International Conference on Cluster Computing.

[6]  Ian Glendinning,et al.  Parallel and Distributed Processing , 2001, Digital Image Analysis.

[7]  Domenico Talia,et al.  Scalable Parallel Clustering for Data Mining on Multicomputers , 2000, IPDPS Workshops.

[8]  David Pettinger,et al.  Scalability of efficient parallel K-Means , 2009, 2009 5th IEEE International Conference on E-Science Workshops.

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[11]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[12]  Mohammed J. Zaki,et al.  Large-Scale Parallel Data Mining , 2002, Lecture Notes in Computer Science.

[13]  Giuseppe Di Fatta,et al.  Space Partitioning for Scalable K-Means , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[14]  Giuseppe Di Fatta,et al.  Dynamic Load Balancing in Parallel KD-Tree k-Means , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[15]  Anil K. Jain,et al.  Large-Scale Parallel Data Clustering , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..