Effects of resampling method and adaptation on clustering ensemble efficacy

Clustering ensembles combine multiple partitions of data into a single clustering solution of better quality. Inspired by the success of supervised bagging and boosting algorithms, we propose non-adaptive and adaptive resampling schemes for the integration of multiple independent and dependent clusterings. We investigate the effectiveness of bagging techniques, comparing the efficacy of sampling with and without replacement, in conjunction with several consensus algorithms. In our adaptive approach, individual partitions in the ensemble are sequentially generated by clustering specially selected subsamples of the given dataset. The sampling probability for each data point dynamically depends on the consistency of its previous assignments in the ensemble. New subsamples are then drawn to increasingly focus on the problematic regions of the input feature space. A measure of data point clustering consistency is therefore defined to guide this adaptation. Experimental results show improved stability and accuracy for clustering structures obtained via bootstrapping, subsampling, and adaptive techniques. A meaningful consensus partition for an entire set of data points emerges from multiple clusterings of bootstraps and subsamples. Subsamples of small size can reduce computational cost and measurement complexity for many unsupervised data mining tasks with distributed sources of data. This empirical study also compares the performance of adaptive and non-adaptive clustering ensembles using different consensus functions on a number of datasets. By focusing attention on the data points with the least consistent clustering assignments, whether one can better approximate the inter-cluster boundaries or can at least create diversity in boundaries and this results in improving clustering accuracy and convergence speed as a function of the number of partitions in the ensemble. The comparison of adaptive and non-adaptive approaches is a new avenue for research, and this study helps to pave the way for the useful application of distributed data mining methods.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Mohamed S. Kamel,et al.  Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Sanghamitra Bandyopadhyay,et al.  A new multiobjective clustering technique based on the concepts of stability and symmetry , 2010, Knowledge and Information Systems.

[4]  William F. Punch,et al.  Using Genetic Algorithms for Data Mining Optimization in an Educational Web-Based System , 2003, GECCO.

[5]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[6]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[7]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[8]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[9]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[10]  John K. Dixon,et al.  Pattern Recognition with Partly Missing Data , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[11]  Hamid Parvin,et al.  A New Approach to Improve the Vote-Based Classifier Selection , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[12]  Jean-Pierre Barthélemy,et al.  The Median Procedure for Partitions , 1993, Partitioning Data Sets.

[13]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[14]  David M. W. Powers,et al.  Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[15]  Anil K. Jain,et al.  The bootstrap approach to clustering , 1987 .

[16]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[17]  H. Alizadeh,et al.  Divide & Conquer Classification and Optimization by Genetic Algorithm , 2008, 2008 Third International Conference on Convergence and Hybrid Information Technology.

[18]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[19]  Nong Ye,et al.  The Handbook of Data Mining , 2003 .

[20]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[21]  Meichun Hsu,et al.  Accurate Recasting of Parameter Estimation Algorithms Using Sufficient Statistics for Efficient Parallel Speed-Up: Demonstrated for Center-Based Data Clustering Algorithms , 2000, PKDD.

[22]  Joachim M. Buhmann,et al.  A Resampling Approach to Cluster Validation , 2002, COMPSTAT.

[23]  S. Odewahn,et al.  Automated star/galaxy discrimination with neural networks , 1992 .

[24]  Morteza Analoui,et al.  A Scalable Method for Improving the Performance of Classifiers in Multiclass Applications by Pairwise Classifiers and GA , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[25]  Anil K. Jain,et al.  Adaptive clustering ensembles , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[26]  Joachim M. Buhmann,et al.  Path-Based Clustering for Grouping of Smooth Curves and Texture Segmentation , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  William F. Punch,et al.  A Comparison of Resampling Methods for Clustering Ensembles , 2004, IC-AI.

[28]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[30]  Joachim M. Buhmann,et al.  Data Resampling for Path Based Clustering , 2002, DAGM-Symposium.

[31]  K JainAnil,et al.  Combining Multiple Clusterings Using Evidence Accumulation , 2005 .

[32]  L. Breiman Arcing Classifiers , 1998 .

[33]  M. Mohammadi,et al.  Neural Network Ensembles Using Clustering Ensemble and Genetic Algorithm , 2008, 2008 Third International Conference on Convergence and Hybrid Information Technology.

[34]  Grigorios Tsoumakas,et al.  Distributed Data Mining , 2009, Encyclopedia of Data Warehousing and Mining.

[35]  William F. Punch,et al.  Ensembles of partitions via data resampling , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[36]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[37]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[38]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[39]  Andreas Stafylopatis,et al.  A clustering method based on boosting , 2004, Pattern Recognit. Lett..

[40]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[41]  Josef Kittler,et al.  Pattern Recognition Theory and Applications , 1987, NATO ASI Series.

[42]  Morteza Analoui,et al.  CCHR: Combination of Classifiers Using Heuristic Retraining , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[43]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[44]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[45]  Christoph F. Eick,et al.  GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets , 2011, Knowledge and Information Systems.

[46]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[47]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .