Speeding up k-means Clustering by Bootstrap Averaging

K-means clustering is one of the most popular clustering algorithms used in data mining. However, clustering is a time consuming task, particularly with the large data sets found in data mining. In this paper we show how bootstrap averaging with k-means can produce results comparable to clustering all of the data but in much less time. The approach of bootstrap (sampling with replacement) averaging consists of running k-means clustering to convergence on small bootstrap samples of the training data and averaging similar cluster centroids to obtain a single model. We show why our approach should take less computation time and empirically illustrate its benefits. We show that the performance of our approach is a monotonic function of the size of the bootstrap sample. However, knowing the size of the bootstrap sample that yields as good results as clustering the entire data set remains an open and important question.

[1]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[2]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[3]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[4]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[5]  Bo Thiesson,et al.  The learning-curve method applied to model-based clustering , 2002 .

[6]  Joel Max,et al.  Quantizing for minimum distortion , 1960, IRE Trans. Inf. Theory.

[7]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[10]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..