Streaming k-Means Clustering with Fast Queries

We present methods for k-means clustering on a stream with a focus on providing fast responses to clustering queries. Compared to the current state-of-the-art, our methods provide substantial improvement in the query time for cluster centers while retaining the desirable properties of provably small approximation error and low space usage. Our algorithms rely on a novel idea of "coreset caching" that systematically reuses coresets (summaries of data) computed for recent queries in answering the current clustering query. We present both theoretical analysis and detailed experiments demonstrating their correctness and e ciency.

[1]  Yu Zhang,et al.  Streaming Algorithms for k-Means Clustering with Fast Queries , 2017, ArXiv.

[2]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[3]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[4]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[5]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[6]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[7]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[8]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[9]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[10]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[11]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[12]  Adam Meyerson,et al.  Fast and Accurate k-means For Large Datasets , 2011, NIPS.

[13]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[14]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[15]  Jean Paul Barddal,et al.  SNCStream+: Extending a high quality true anytime data stream clustering algorithm , 2016, Inf. Syst..

[16]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[17]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[18]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[19]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[20]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.