Distributed k-Means and k-Median Clustering on General Topologies

This paper provides new algorithms for distributed clustering for two popular center-based objectives, k-median and k-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by \cite{har2004coresets}, we reduce the problem of finding a clustering with low cost to the problem of finding a coreset of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experimental results on large scale data sets show that this approach outperforms other coreset-based distributed clustering algorithms.

[1]  Bin Zhang,et al.  Distributed data clustering can be efficient and exact , 2000, SKDD.

[2]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[3]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[4]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[5]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[6]  H. Kriegel,et al.  Towards Effective and Efficient Distributed Clustering , 2003 .

[7]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[8]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[9]  Dimitris K. Tasoulis,et al.  Unsupervised distributed clustering , 2004, Parallel and Distributed Computing and Networks.

[10]  Sanjeev Khanna,et al.  Power-conserving computation of order-statistics over sensor networks , 2004, PODS.

[11]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[12]  Shai Ben-David A Framework for Statistical Clustering with a Constant Time Approximation Algorithms for K-Median Clustering , 2004, COLT.

[13]  Hans-Peter Kriegel,et al.  Effective and efficient distributed model-based clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[14]  H. Kargupta,et al.  K-Means Clustering over Peer-to-peer Networks , 2005 .

[15]  Ke Chen,et al.  On k-Median clustering in high dimensions , 2006, SODA '06.

[16]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2007, Discret. Comput. Geom..

[17]  Svetha Venkatesh,et al.  Distributed query processing for mobile surveillance , 2007, ACM Multimedia.

[18]  Qi Zhang,et al.  Approximate Clustering on Distributed Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[19]  Niklas Carlsson,et al.  Characterizing web-based video sharing workloads , 2009, WWW '09.

[20]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[21]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[22]  Sariel Har-Peled Geometric Approximation Algorithms , 2011 .

[23]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[24]  Avishek Saha,et al.  Efficient Protocols for Distributed Classification and Optimization , 2012, ALT.

[25]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[26]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[27]  Dan Feldman,et al.  An effective coreset compression algorithm for large scale sensor networks , 2012, 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN).

[28]  Santosh S. Vempala,et al.  Nimble Algorithms for Cloud Computing , 2013, ArXiv.

[29]  Shi Li,et al.  Approximating k-median via pseudo-approximation , 2012, STOC '13.

[30]  S. Sophia,et al.  Real-World Applications of Distributed Clustering Mechanism in Dense Wireless Sensor Networks , 2013 .

[31]  Maria-Florina Balcan,et al.  Center Based Clustering: A Foundational Perspective , 2014 .

[32]  Santosh S. Vempala,et al.  Principal Component Analysis and Higher Correlations for Distributed Data , 2013, COLT.