LOG-Means

Clustering is a fundamental primitive in manifold applications. In order to achieve valuable results, parameters of the clustering algorithm, e.g., the number of clusters, have to be set appropriately, which is a tremendous pitfall. To this end, analysts rely on their domain knowledge in order to define parameter search spaces. While experienced analysts may be able to define a small search space, especially novice analysts often define rather large search spaces due to the lack of in-depth domain knowledge. These search spaces can be explored in different ways by estimation methods for the number of clusters. In the worst case, estimation methods perform an exhaustive search in the given search space, which leads to infeasible runtimes for large datasets and large search spaces. We propose LOG-Means, which is able to overcome these issues of existing methods. We show that LOG-Means provides estimates in sublinear time regarding the defined search space, thus being a strong fit for large datasets and large search spaces. In our comprehensive evaluation on an Apache Spark cluster, we compare LOG-Means to 13 existing estimation methods. The evaluation shows that LOG-Means significantly outperforms these methods in terms of runtime and accuracy. To the best of our knowledge, this is the most systematic comparison on large datasets and search spaces as of today.

[1]  atherine,et al.  Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[2]  Paul S. Bradley,et al.  Clustering via Concave Minimization , 1996, NIPS.

[3]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[4]  Jitender S. Deogun,et al.  Conceptual clustering in information retrieval , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[5]  T. W. Anderson,et al.  Asymptotic Theory of Certain "Goodness of Fit" Criteria Based on Stochastic Processes , 1952 .

[6]  Aaron Klein,et al.  BOHB: Robust and Efficient Hyperparameter Optimization at Scale , 2018, ICML.

[7]  Boris G. Mirkin,et al.  Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads , 2010, J. Classif..

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[10]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[11]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[12]  Hichem Frigui,et al.  A Robust Competitive Clustering Algorithm With Applications in Computer Vision , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[14]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[15]  S. Dolnicar,et al.  An examination of indexes for determining the number of clusters in binary data sets , 2002, Psychometrika.

[16]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[17]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[18]  Pierre Baldi,et al.  DNA Microarrays and Gene Expression - From Experiments to Data Analysis and Modeling , 2002 .

[19]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[20]  Holger Schwarz,et al.  Quality-driven early stopping for explorative cluster analysis for big data , 2019, SICS Software-Intensive Cyber-Physical Systems.

[21]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Jianying Hu,et al.  Statistical methods for automated generation of service engagement staffing plans , 2007, IBM J. Res. Dev..

[23]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[24]  Michael Randolph Garey,et al.  The complexity of the generalized Lloyd - Max problem , 1982, IEEE Trans. Inf. Theory.

[25]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[26]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[27]  H. Akaike A new look at the statistical model identification , 1974 .

[28]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  R. L. Thorndike Who belongs in the family? , 1953 .

[30]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[31]  Aaron D. Wyner,et al.  Coding Theorems for a Discrete Source With a Fidelity CriterionInstitute of Radio Engineers, International Convention Record, vol. 7, 1959. , 1993 .

[32]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[33]  Anil K. Jain,et al.  A spatial filtering approach to texture analysis , 1985, Pattern Recognit. Lett..

[34]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[35]  Holger Schwarz,et al.  Initializing k-Means Efficiently: Benefits for Exploratory Cluster Analysis , 2019, OTM Conferences.

[36]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[37]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[38]  Girish N. Punj,et al.  Cluster Analysis in Marketing Research: Review and Suggestions for Application , 1983 .

[39]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[40]  Geppino Pucci,et al.  Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially , 2018, Proc. VLDB Endow..

[41]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[42]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[43]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .