Streaming k-means on well-clusterable data

One of the central problems in data-analysis is k-means clustering. In recent years, considerable attention in the literature addressed the streaming variant of this problem, culminating in a series of results (Har-Peled and Mazumdar; Frahling and Sohler; Frahling, Monemizadeh, and Sohler; Chen) that produced a (1 + ε)-approximation for k-means clustering in the streaming setting. Unfortunately, since optimizing the k-means objective is Max-SNP hard, all algorithms that achieve a (1 + ε)-approximation must take time exponential in k unless P=NP. Thus, to avoid exponential dependence on k, some additional assumptions must be made to guarantee high quality approximation and polynomial running time. A recent paper of Ostrovsky, Rabani, Schulman, and Swamy (FOCS 2006) introduced the very natural assumption of data separability: the assumption closely reflects how k-means is used in practice and allowed the authors to create a high-quality approximation for k-means clustering in the non-streaming setting with polynomial running time even for large values of k. Their work left open a natural and important question: are similar results possible in a streaming setting? This is the question we answer in this paper, albeit using substantially different techniques. We show a near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass, under the same data separability assumption. Our algorithm offers significant improvements in both space and running time over previous work while yielding asymptotically best-possible performance (assuming that the running time must be fully polynomial and P ≠ NP). The novel techniques we develop along the way imply a number of additional results: we provide a high-probability performance guarantee for online facility location (in contrast, Meyerson's FOCS 2001 algorithm gave bounds only in expectation); we develop a constant approximation method for the general class of semi-metric clustering problems; we improve (even without σ-separability) by a logarithmic factor space requirements for streaming constant-approximation for k-median; finally we design a "re-sampling method" in a streaming setting to convert any constant approximation for clustering to a [1 + O(σ2)]-approximation for σ-separable data.

[1]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[2]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[3]  Sanjay Ranka,et al.  An Efficient Space-Partitioning Based Algorithm for the K-Means Clustering , 1999, PAKDD.

[4]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[5]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[6]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[7]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[8]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[9]  L. F. Tóth Sur la représentation d'une population infinie par un nombre fini d'éléments , 1959 .

[10]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[11]  S. Hodge,et al.  Statistics and Probability , 1972 .

[12]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[13]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[14]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[15]  David L. Neuhoff,et al.  Quantization , 2022, IEEE Trans. Inf. Theory.

[16]  P. Zador DEVELOPMENT AND EVALUATION OF PROCEDURES FOR QUANTIZING MULTIVARIATE DISTRIBUTIONS , 1963 .

[17]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[18]  Joel Max,et al.  Quantizing for minimum distortion , 1960, IRE Trans. Inf. Theory.

[19]  Steven J. Phillips Acceleration of K-Means and Related Clustering Algorithms , 2002, ALENEX.

[20]  R. Jancey Multidimensional group analysis , 1966 .

[21]  R. Tryon Cluster Analysis , 1939 .

[22]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[23]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[24]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[25]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[26]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[27]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[28]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[29]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[30]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[31]  Walter D. Fisher On Grouping for Maximum Homogeneity , 1958 .

[32]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[33]  Geoffrey H. Ball,et al.  Data analysis in the social sciences: what about the details? , 1965, AFIPS '65 (Fall, part I).

[34]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[35]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[36]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[37]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[38]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.