Initializing k-Means Efficiently: Benefits for Exploratory Cluster Analysis

Data analysis is a highly exploratory task, where various algorithms with different parameters are executed until a solid result is achieved. This is especially evident for cluster analyses, where the number of clusters must be provided prior to the execution of the clustering algorithm. Since this number is rarely known in advance, the algorithm is typically executed several times with varying parameters. Hence, the duration of the exploratory analysis heavily dependends on the runtime of each execution of the clustering algorithm. While previous work shows that the initialization of clustering algorithms is crucial for fast and solid results, it solely focuses on a single execution of the clustering algorithm and thereby neglects previous executions. We propose Delta Initialization as an initialization strategy for k-Means in such an exploratory setting. The core idea of this new algorithm is to exploit the clustering results of previous executions in order to enhance the initialization of subsequent executions. We show that this algorithm is well suited for exploratory cluster analysis as considerable speedups can be achieved while additionally achieving superior clustering results compared to state-of-the-art initialization strategies.

[1]  Cheng Soon Ong,et al.  Multivariate spearman's ρ for aggregating ranks using copulas , 2016 .

[2]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[3]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[4]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[5]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[6]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[7]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[8]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[10]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[11]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[12]  Holger Schwarz,et al.  Quality-driven early stopping for explorative cluster analysis for big data , 2019, SICS Software-Intensive Cyber-Physical Systems.

[13]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[16]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[17]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[18]  Greg Hamerly,et al.  Accelerating Lloyd’s Algorithm for k -Means Clustering , 2015 .

[19]  Adriana Mexicano,et al.  The early stop heuristic: A new convergence criterion for K-means , 2016 .

[20]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[21]  Holger Schwarz,et al.  ASAP-DM: a framework for automatic selection of analytic platforms for data mining , 2019, SICS Software-Intensive Cyber-Physical Systems.

[22]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[23]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[24]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[25]  Lars Kotthoff,et al.  Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA , 2017, J. Mach. Learn. Res..

[26]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.