Non-uniform data distribution for communication-efficient parallel clustering

Abstract Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operation which hinders the scalability of the approach. This work studies a different parallel formulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature of the centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.

[1]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[2]  Márk Jelasity,et al.  PeerSim: A scalable P2P simulator , 2009, 2009 IEEE Ninth International Conference on Peer-to-Peer Computing.

[3]  André Schiper Dynamic group communication , 2005, Distributed Computing.

[4]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[5]  Anil K. Jain,et al.  Large-Scale Parallel Data Clustering , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Srinivas Aluru,et al.  Parallel construction of multidimensional binary search trees , 2000, ICS '96.

[7]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[8]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[9]  Mark Baker,et al.  MPJ Express: Towards Thread Safe Java HPC , 2006, 2006 IEEE International Conference on Cluster Computing.

[10]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Domenico Talia,et al.  Scalable Parallel Clustering for Data Mining on Multicomputers , 2000, IPDPS Workshops.

[12]  Giuseppe Tradigo,et al.  A time series approach for clustering mass spectrometry data , 2012, J. Comput. Sci..

[13]  Sanjay Ranka,et al.  An Efficient Space-Partitioning Based Algorithm for the K-Means Clustering , 1999, PAKDD.

[14]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[15]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[16]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[17]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[18]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[19]  Nancy A. Lynch,et al.  A dynamic view-oriented group communication service , 1998, PODC '98.

[20]  David Pettinger,et al.  Scalability of efficient parallel K-Means , 2009, 2009 5th IEEE International Conference on E-Science Workshops.

[21]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[22]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[23]  Yun Zhou,et al.  The Reliability Wall for Exascale Supercomputing , 2012, IEEE Transactions on Computers.

[24]  Giuseppe Di Fatta,et al.  Space Partitioning for Scalable K-Means , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[25]  Giuseppe Di Fatta,et al.  Dynamic Load Balancing in Parallel KD-Tree k-Means , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[26]  Marc Bui,et al.  Competitive clustering algorithms based on ultrametric properties , 2013, J. Comput. Sci..

[27]  Metin Nafi Gürcan,et al.  An efficient computational framework for the analysis of whole slide images: Application to follicular lymphoma immunohistochemistry , 2012, J. Comput. Sci..