Distributed PCA and k-Means Clustering

This paper proposes a distributed PCA algorithm, with the theoretical guarantee that any good approximation solution on the projected data for k-means clustering is also a good approximation on the original data, while the projected dimension required is independent of the original dimension. When combined with the distributed coreset-based clustering approach in [3], this leads to an algorithm in which the number of vectors communicated is independent of the size and the dimension of the original data. Our experiment results demonstrate the effectiveness of the algorithm.

[1]  N. Samatova,et al.  Principal Component Analysis for Dimension Reduction in Massive Distributed Data Sets ∗ , 2002 .

[2]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[3]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[4]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[5]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[6]  Sanjeev Khanna,et al.  Power-conserving computation of order-statistics over sensor networks , 2004, PODS.

[7]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[8]  Franklin T. Luk,et al.  Principal Component Analysis for Distributed Data Sets with Updating , 2005, APPT.

[9]  Svetha Venkatesh,et al.  Distributed query processing for mobile surveillance , 2007, ACM Multimedia.

[10]  Sylvain Raybaud,et al.  Distributed Principal Component Analysis for Wireless Sensor Networks , 2008, Sensors.

[11]  Niklas Carlsson,et al.  Characterizing web-based video sharing workloads , 2009, WWW '09.

[12]  Amit Kumar,et al.  Clustering with Spectral Norm and the k-Means Algorithm , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[13]  Sergio Valcarcel Macua,et al.  Consensus-based distributed principal component analysis in wireless sensor networks , 2010, 2010 IEEE 11th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC).

[14]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[17]  Maria-Florina Balcan,et al.  Distributed k-means and k-median clustering on general communication topologies , 2013, NIPS.

[18]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.