Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

@d can be approximated up to (1 + e)-factor, for an arbitrary small e > 0, using the O(k/e2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + e)-approximated by an optimal k-means clustering of their projection on the O(k/e2) first right singular vectors (principle components) of A. A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + e)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size O(k) for handling k-means queries, (j, 1)-coresets of size O(j) for PCA queries, and (j, k)-coresets of size (log n)O(jk) for any j, k ≥ 1 and constant e e (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ~ n. Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d. For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size

[1]  H. Warren Lower bounds for approximation by nonlinear manifolds , 1968 .

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[4]  Nimrod Megiddo,et al.  On the complexity of locating linear facilities in the plane , 1982, Oper. Res. Lett..

[5]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[6]  Michael D. Vose,et al.  A Linear Algorithm For Generating Random Numbers With a Given Distribution , 1991, IEEE Trans. Software Eng..

[7]  G. W. Stewart,et al.  On the Early History of the Singular Value Decomposition , 1993, SIAM Rev..

[8]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[9]  Micha Sharir,et al.  Davenport-Schinzel sequences and their geometric applications , 1995, Handbook of Computational Geometry.

[10]  J. Matou On Approximate Geometric K-clustering , 1999 .

[11]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[12]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[13]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[14]  K. Roberts,et al.  Thesis , 2002 .

[15]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[18]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[19]  Sariel Har-Peled,et al.  No, Coreset, No Cry , 2004, FSTTCS.

[20]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[21]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[22]  Kasturi R. Varadarajan,et al.  No Coreset, No Cry: II , 2005, FSTTCS.

[23]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[24]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[25]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[26]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[27]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[28]  Amos Fiat,et al.  Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[29]  Dan Feldman Coresets for Weighted Facilities and Their Applications , 2006 .

[30]  S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, ACM-SIAM Symposium on Discrete Algorithms.

[31]  Christian Sohler,et al.  A fast k-means implementation using coresets , 2006, SCG '06.

[32]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[33]  Kasturi R. Varadarajan,et al.  Efficient Subspace Approximation Algorithms , 2007, Discrete & Computational Geometry.

[34]  David Eisenstat,et al.  The VC dimension of k-fold union , 2007, Inf. Process. Lett..

[35]  Sariel Har-Peled How to get close to the median shape , 2007, Comput. Geom..

[36]  Kasturi R. Varadarajan,et al.  Sampling-based dimension reduction for subspace approximation , 2007, STOC '07.

[37]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[38]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[39]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[40]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[41]  M. Waldrop,et al.  Community cleverness required , 2008, Nature.

[42]  Marcel R. Ackermann,et al.  Clustering for metric and non-metric distance measures , 2008, SODA '08.

[43]  Trac D. Tran,et al.  A fast and efficient algorithm for low-rank approximation of a matrix , 2009, STOC '09.

[44]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[45]  A. Jacobs The Pathologies of Big Data , 2009, ACM Queue.

[46]  Johannes Blömer,et al.  Coresets and approximate clustering for Bregman divergences , 2009, SODA.

[47]  David P. Woodruff,et al.  Numerical linear algebra in the streaming model , 2009, STOC '09.

[48]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[49]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[50]  Christos Boutsidis,et al.  Unsupervised Feature Selection for the $k$-means Clustering Problem , 2009, NIPS.

[51]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[52]  Nikhil Srivastava,et al.  Twice-ramanujan sparsifiers , 2008, STOC '09.

[53]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[54]  Antony J. Williams,et al.  Beautiful Data: The Stories Behind Elegant Data Solutions , 2009 .

[55]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[56]  Michael Langberg,et al.  Universal epsilon-approximators for Integrals , 2010, ACM-SIAM Symposium on Discrete Algorithms.

[57]  Christos Boutsidis,et al.  Random Projections for $k$-means Clustering , 2010, NIPS.

[58]  Luis Rademacher,et al.  Efficient Volume Sampling for Row/Column Subset Selection , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[59]  David P. Woodruff,et al.  Coresets and sketches for high dimensional subspace approximation problems , 2010, SODA '10.

[60]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[61]  Christian Sohler,et al.  StreamKM++: A Clustering Algorithms for Data Streams , 2010, Workshop on Algorithm Engineering and Experimentation.

[62]  Micha Sharir,et al.  Relative (p,ε)-Approximations in Geometry , 2011, Discret. Comput. Geom..

[63]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[64]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[65]  Martin Hilbert,et al.  The World’s Technological Capacity to Store, Communicate, and Compute Information , 2011, Science.

[66]  Nisheeth K. Vishnoi,et al.  Algorithms and hardness for subspace approximation , 2009, SODA '11.

[67]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[68]  Matthew B. Jones,et al.  Challenges and Opportunities of Open Data in Ecology , 2011, Science.

[69]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[70]  Dan Feldman,et al.  Data reduction for weighted and outlier-resistant clustering , 2012, SODA.

[71]  Xin Xiao,et al.  A near-linear algorithm for projective clustering integer points , 2012, SODA.

[72]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[73]  Xin Xiao,et al.  On the Sensitivity of Shape Fitting Problems , 2012, FSTTCS.

[74]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[75]  Santosh S. Vempala,et al.  Nimble Algorithms for Cloud Computing , 2013, ArXiv.

[76]  Edo Liberty,et al.  Simple and deterministic matrix sketching , 2012, KDD.

[77]  Christian Sohler,et al.  BICO: BIRCH Meets Coresets for k-Means Clustering , 2013, ESA.

[78]  Melanie Schmidt,et al.  Coresets and streaming algorithms for the k-means problem and related clustering objectives , 2014 .

[79]  Vahab S. Mirrokni,et al.  Composable core-sets for diversity and coverage maximization , 2014, PODS.

[80]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[81]  Morteza Zadimoghaddam,et al.  Randomized Composable Core-sets for Distributed Submodular Maximization , 2015, STOC.

[82]  Hamid Zarrabi-Zadeh,et al.  Diversity Maximization via Composable Coresets , 2015, CCCG.

[83]  Christos Boutsidis,et al.  Randomized Dimensionality Reduction for $k$ -Means Clustering , 2011, IEEE Transactions on Information Theory.

[84]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[85]  B. AfeArd CALCULATING THE SINGULAR VALUES AND PSEUDOINVERSE OF A MATRIX , 2022 .