Distributed Clustering on Graphs

This paper provides new algorithms for distributed clustering for two popular center-based objectives, k-median and k-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by [13], we reduce the problem of finding a clustering with low cost to the problem of finding a ‘coreset’ of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experiment results on large scale data sets show that this approach outperforms other coreset-based distributed clustering algorithms.

[1]  Shi Li,et al.  Approximating k-median via pseudo-approximation , 2012, STOC '13.

[2]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[3]  Wilson C. Hsieh,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 251 Spanner: Google's Globally-distributed Database , 2022 .

[4]  H. Kriegel,et al.  Towards Effective and Efficient Distributed Clustering , 2003 .

[5]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[6]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[7]  Sanjeev Khanna,et al.  Power-conserving computation of order-statistics over sensor networks , 2004, PODS.

[8]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[9]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[10]  Qi Zhang,et al.  Approximate Clustering on Distributed Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[12]  Shai Ben-David A Framework for Statistical Clustering with a Constant Time Approximation Algorithms for K-Median Clustering , 2004, COLT.

[13]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[14]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[15]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[16]  Ke Chen,et al.  On k-Median clustering in high dimensions , 2006, SODA '06.

[17]  Bin Zhang,et al.  Distributed data clustering can be efficient and exact , 2000, SKDD.

[18]  Dimitris K. Tasoulis,et al.  Unsupervised distributed clustering , 2004, Parallel and Distributed Computing and Networks.

[19]  Niklas Carlsson,et al.  Characterizing web-based video sharing workloads , 2009, WWW '09.

[20]  H. Kargupta,et al.  K-Means Clustering over Peer-to-peer Networks , 2005 .

[21]  Dan Feldman,et al.  An effective coreset compression algorithm for large scale sensor networks , 2012, 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN).

[22]  Svetha Venkatesh,et al.  Distributed query processing for mobile surveillance , 2007, ACM Multimedia.

[23]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[24]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.