Scaling Optimizations for Large-Scale Distributed Data with Lightweight Coresets

Lightweight coresets are compact representations of data sets such that clustering methods present competitive results in relation to the complete data set. They are constructed by sampling important points from the complete set. We propose a fast method to approximate the sampling of lightweight coresets from very large data sets which are distributed among multiple machines. We show that the proposed method is much faster and scalable, reaching results 48 times faster than the original lightweight coresets, while holding similar properties.

[1]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[2]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[3]  Andreas Krause,et al.  Scalable k -Means Clustering via Lightweight Coresets , 2017, KDD.

[4]  Saeed Shahrivari,et al.  High performance parallel $$k$$k-means clustering for disk-resident datasets on multi-core CPUs , 2014, The Journal of Supercomputing.

[5]  Leonardo Torok,et al.  k-MS: A novel clustering algorithm based on morphological reconstruction , 2017, Pattern Recognit..

[6]  Chandrabose Aravindan,et al.  Strategies for Parallelizing KMeans Data Clustering Algorithm , 2011 .

[7]  Jing Zhang,et al.  A Parallel K-Means Clustering Algorithm with MPI , 2011, 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming.

[8]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[9]  Serge Guillaume,et al.  ProTraS: A probabilistic traversing sampling algorithm , 2018, Expert Syst. Appl..

[10]  Christian Böhm,et al.  Multi-core K-means , 2017, SDM.

[11]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[12]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[13]  Serge Guillaume,et al.  DIDES: a fast and effective sampling for clustering algorithm , 2017, Knowledge and Information Systems.