论文信息 - Streaming k-means on well-clusterable data

Streaming k-means on well-clusterable data

One of the central problems in data-analysis is k-means clustering. In recent years, considerable attention in the literature addressed the streaming variant of this problem, culminating in a series of results (Har-Peled and Mazumdar; Frahling and Sohler; Frahling, Monemizadeh, and Sohler; Chen) that produced a (1 + ε)-approximation for k-means clustering in the streaming setting. Unfortunately, since optimizing the k-means objective is Max-SNP hard, all algorithms that achieve a (1 + ε)-approximation must take time exponential in k unless P=NP. Thus, to avoid exponential dependence on k, some additional assumptions must be made to guarantee high quality approximation and polynomial running time. A recent paper of Ostrovsky, Rabani, Schulman, and Swamy (FOCS 2006) introduced the very natural assumption of data separability: the assumption closely reflects how k-means is used in practice and allowed the authors to create a high-quality approximation for k-means clustering in the non-streaming setting with polynomial running time even for large values of k. Their work left open a natural and important question: are similar results possible in a streaming setting? This is the question we answer in this paper, albeit using substantially different techniques. We show a near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass, under the same data separability assumption. Our algorithm offers significant improvements in both space and running time over previous work while yielding asymptotically best-possible performance (assuming that the running time must be fully polynomial and P ≠ NP). The novel techniques we develop along the way imply a number of additional results: we provide a high-probability performance guarantee for online facility location (in contrast, Meyerson's FOCS 2001 algorithm gave bounds only in expectation); we develop a constant approximation method for the general class of semi-metric clustering problems; we improve (even without σ-separability) by a logarithmic factor space requirements for streaming constant-approximation for k-median; finally we design a "re-sampling method" in a streaming setting to convert any constant approximation for clustering to a [1 + O(σ2)]-approximation for σ-separable data.

[1] Dan Feldman,et al. A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[2] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[3] Sanjay Ranka,et al. An Efficient Space-Partitioning Based Algorithm for the K-Means Clustering , 1999, PAKDD.

[4] Axthonv G. Oettinger,et al. IEEE Transactions on Information Theory , 1998 .

[5] Christian Sohler,et al. Coresets in dynamic geometric data streams , 2005, STOC '05.

[6] Robert M. Gray,et al. An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[7] Sanjay Ranka,et al. An effic ient k-means clustering algorithm , 1997 .

[8] Kamesh Munagala,et al. Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[9] L. F. Tóth. Sur la représentation d'une population infinie par un nombre fini d'éléments , 1959 .

[10] Geoffrey H. Ball,et al. ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[11] S. Hodge,et al. Statistics and Probability , 1972 .

[12] Allen Gersho,et al. Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[13] E. Forgy,et al. Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[14] Andrew W. Moore,et al. Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[15] David L. Neuhoff,et al. Quantization , 2022, IEEE Trans. Inf. Theory.

[16] P. Zador. DEVELOPMENT AND EVALUATION OF PROCEDURES FOR QUANTIZING MULTIVARIATE DISTRIBUTIONS , 1963 .

[17] Sergei Vassilvitskii,et al. How slow is the k-means method? , 2006, SCG '06.

[18] Joel Max,et al. Quantizing for minimum distortion , 1960, IRE Trans. Inf. Theory.

[19] Steven J. Phillips. Acceleration of K-Means and Related Clustering Algorithms , 2002, ALENEX.

[20] R. Jancey. Multidimensional group analysis , 1966 .

[21] R. Tryon. Cluster Analysis , 1939 .

[22] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[23] David M. Mount,et al. A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[24] Jeffrey Scott Vitter,et al. Random sampling with a reservoir , 1985, TOMS.

[25] Maria-Florina Balcan,et al. Approximate clustering without the approximation , 2009, SODA.

[26] Rina Panigrahy,et al. Better streaming algorithms for clustering problems , 2003, STOC '03.

[27] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[28] P. Sopp. Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[29] Piotr Indyk,et al. Approximate clustering via core-sets , 2002, STOC '02.

[30] Sudipto Guha,et al. Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[31] Walter D. Fisher. On Grouping for Maximum Homogeneity , 1958 .

[32] Sudipto Guha,et al. Clustering Data Streams , 2000, FOCS.

[33] Geoffrey H. Ball,et al. Data analysis in the social sciences: what about the details? , 1965, AFIPS '65 (Fall, part I).

[34] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[35] Adam Meyerson,et al. Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[36] Ke Chen,et al. On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[37] Nir Ailon,et al. Streaming k-means approximation , 2009, NIPS.

[38] Sariel Har-Peled,et al. On coresets for k-means and k-median clustering , 2004, STOC '04.