A unified framework for approximating and clustering data

Given a set F of n positive functions over a ground set X, we consider the problem of computing x* that minimizes the expression ∑f ∈ Ff(x), over x ∈ X. A typical application is shape fitting, where we wish to approximate a set P of n elements (say, points) by a shape x from a (possibly infinite) family X of shapes. Here, each point p ∈ P corresponds to a function f such that f(x) is the distance from p to x, and we seek a shape x that minimizes the sum of distances from each point in P. In the k-clustering variant, each x\in X is a tuple of k shapes, and f(x) is the distance from p to its closest shape in x. Our main result is a unified framework for constructing coresets and approximate clustering for such general sets of functions. To achieve our results, we forge a link between the classic and well defined notion of ε-approximations from the theory of PAC Learning and VC dimension, to the relatively new (and not so consistent) paradigm of coresets, which are some kind of "compressed representation" of the input set F. Using traditional techniques, a coreset usually implies an LTAS (linear time approximation scheme) for the corresponding optimization problem, which can be computed in parallel, via one pass over the data, and using only polylogarithmic space (i.e, in the streaming model). For several function families F for which coresets are known not to exist, or the corresponding (approximate) optimization problems are hard, our framework yields bicriteria approximations, or coresets that are large, but contained in a low-dimensional space. We demonstrate our unified framework by applying it on projective clustering problems. We obtain new coreset constructions and significantly smaller coresets, over the ones that appeared in the literature during the past years, for problems such as: k-Median [Har-Peled and Mazumdar,STOC'04], [Chen, SODA'06], [Langberg and Schulman, SODA'10]; k-Line median [Feldman, Fiat and Sharir, FOCS'06], [Deshpande and Varadarajan, STOC'07]; Projective clustering [Deshpande et al., SODA'06] [Deshpande and Varadarajan, STOC'07]; Linear lp regression [Clarkson, Woodruff, STOC'09 ]; Low-rank approximation [Sarlos, FOCS'06]; Subspace approximation [Shyamalkumar and Varadarajan, SODA'07], [Feldman, Monemizadeh, Sohler and Woodruff, SODA'10], [Deshpande, Tulsiani, and Vishnoi, SODA'11]. The running times of the corresponding optimization problems are also significantly improved. We show how to generalize the results of our framework for squared distances (as in k-mean), distances to the qth power, and deterministic constructions.

[1]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[2]  David Haussler,et al.  Epsilon-nets and simplex range queries , 1986, SCG '86.

[3]  David Haussler,et al.  ɛ-nets and simplex range queries , 1987, Discret. Comput. Geom..

[4]  Bernard Chazelle,et al.  A deterministic view of random sampling and its use in geometry , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[5]  Jirí Matousek,et al.  Approximations and optimal geometric divide-and-conquer , 1991, STOC '91.

[6]  Leonidas J. Guibas,et al.  Improved Bounds on Weak epsilon-Nets for Convex Sets , 1995, Discrete & Computational Geometry.

[7]  Jirí Matousek,et al.  Approximations and Optimal Geometric Divide-an-Conquer , 1995, J. Comput. Syst. Sci..

[8]  Micha Sharir,et al.  Davenport-Schinzel sequences and their geometric applications , 1995, Handbook of Computational Geometry.

[9]  H. Edelsbrunner,et al.  Improved bounds on weak ε-nets for convex sets , 1995, Discret. Comput. Geom..

[10]  Michael T. Goodrich,et al.  Almost optimal set covers in finite VC-dimension , 1995, Discret. Comput. Geom..

[11]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[12]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[13]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[14]  Sanjoy Dasgupta,et al.  An elementary proof of a theorem of Johnson and Lindenstrauss , 2003, Random Struct. Algorithms.

[15]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[16]  Sariel Har-Peled,et al.  No, Coreset, No Cry , 2004, FSTTCS.

[17]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[18]  David Eppstein,et al.  Deterministic sampling and range counting in geometric data streams , 2004, SCG '04.

[19]  K. Clarkson Subgradient and sampling algorithms for l1 regression , 2005, SODA '05.

[20]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[21]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[22]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[23]  Amos Fiat,et al.  Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[24]  Dan Feldman Coresets for Weighted Facilities and Their Applications , 2006 .

[25]  Ke Chen,et al.  On k-Median clustering in high dimensions , 2006, SODA '06.

[26]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[27]  Sariel Har-Peled,et al.  Coresets for Discrete Integration and Clustering , 2006, FSTTCS.

[28]  Artur Czumaj,et al.  Sublinear‐time approximation algorithms for clustering via random sampling , 2007, Random Struct. Algorithms.

[29]  Kasturi R. Varadarajan,et al.  Efficient Subspace Approximation Algorithms , 2007, Discrete & Computational Geometry.

[30]  Kasturi R. Varadarajan,et al.  Sampling-based dimension reduction for subspace approximation , 2007, STOC '07.

[31]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[32]  Amos Fiat,et al.  Bi-criteria linear-time approximations for generalized k-mean/median/center , 2007, SCG '07.

[33]  Anirban Dasgupta,et al.  Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[34]  Ke Chen,et al.  A constant factor approximation algorithm for k-median clustering with outliers , 2008, SODA '08.

[35]  Haim Kaplan,et al.  Private coresets , 2009, STOC '09.

[36]  David P. Woodruff,et al.  Numerical linear algebra in the streaming model , 2009, STOC '09.

[37]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[38]  Nikhil Srivastava,et al.  Twice-ramanujan sparsifiers , 2008, STOC '09.

[39]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[40]  David P. Woodruff,et al.  Coresets and sketches for high dimensional subspace approximation problems , 2010, SODA '10.

[41]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[42]  Christos Boutsidis,et al.  Near Optimal Column-Based Matrix Reconstruction , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[43]  Dan Feldman,et al.  From High Definition Image to Low Space Optimization , 2011, SSVM.

[44]  A. Naor Sparse quadratic forms and their geometric applications (after Batson, Spielman and Srivastava) , 2011, 1101.4324.

[45]  Nisheeth K. Vishnoi,et al.  Algorithms and hardness for subspace approximation , 2009, SODA '11.