Scalable bootstrap clustering for massive data

The bootstrap provides a simple and powerful means of improving the accuracy of clustering. However, for today's increasingly large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. In this paper we introduce the Bag of Little Bootstraps Clustering (BLBC), a new procedure which utilizes the Bag of Little Bootstraps technique to obtain a robust, computationally efficient means of clustering for massive data. Moreover, BLBC is suited to implementation on modern parallel and distributed computing architectures which are often used to process large datasets. We investigate empirically the performance characteristics of BLBC and compare to the performances of existing methods via experiments on simulated data and real data. The results show that BLBC has a significantly more favorable computational profile than the bootstrap based clustering while maintaining good statistical correctness.

[1]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[2]  Joydeep Ghosh,et al.  Data Clustering Algorithms And Applications , 2013 .

[3]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[4]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[5]  Anil K. Jain,et al.  The bootstrap approach to clustering , 1987 .

[6]  Wei Tang,et al.  Clusterer ensemble , 2006, Knowl. Based Syst..

[7]  William F. Punch,et al.  Effects of resampling method and adaptation on clustering ensemble efficacy , 2011, Artificial Intelligence Review.

[8]  Purnamrita Sarkar,et al.  The Big Data Bootstrap , 2012, ICML.

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[11]  Anil K. Jain,et al.  Adaptive clustering ensembles , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[12]  W. Scott Spangler,et al.  Feature Weighting in k-Means Clustering , 2003, Machine Learning.

[13]  Thomas Hofmann,et al.  Non-redundant clustering with conditional ensembles , 2005, KDD '05.