A Cluster Ensemble Framework for Large Data sets

Combining multiple clustering solutions is important for obtaining a robust clustering solution, merging distributed clustering solutions, and scaling to large data sets. The combination of multiple clustering solutions within a scalable and robust framework for large data sets is discussed. A scalable framework requires both cluster ensemble creation and merging to be efficient in terms of time and memory complexity. We also introduce the concept of filtering malformed clusters from the ensemble. They result from unfortunate initialization or unbalanced data distribution or noise. Experimental results on real data sets show that this approach will scale and provide cluster partitions which are functionally better or equivalent when compared to clustering all the data at once and clustering solutions contained in the ensemble. We have also compared our algorithm with other ensemble merging and scalable algorithms to point out its strengths and limitations.

[1]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[2]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[3]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[4]  Joydeep Ghosh,et al.  A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing , 2002 .

[5]  Lawrence O. Hall,et al.  Fast fuzzy clustering , 1998, Fuzzy Sets Syst..

[6]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[7]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[8]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[9]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[10]  Lawrence O. Hall,et al.  Scalable clustering: a distributed approach , 2004, 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542).

[11]  Philip S. Yu,et al.  Combining multiple clusterings by soft correspondence , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[12]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[13]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[14]  Ana L. N. Fred,et al.  Finding Consistent Clusters in Data Partitions , 2001, Multiple Classifier Systems.

[15]  Hans-Peter Kriegel,et al.  Effective and efficient distributed model-based clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[16]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[17]  William F. Punch,et al.  Ensembles of partitions via data resampling , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[18]  Lawrence O. Hall,et al.  Fast Accurate Fuzzy Clustering through Data Reduction , 2003 .

[19]  James C. Bezdek,et al.  Complexity reduction for "large image" processing , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[20]  Bin Zhang,et al.  Distributed data clustering can be efficient and exact , 2000, SKDD.

[21]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[22]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[23]  Ian Davidson,et al.  Speeding up k-means Clustering by Bootstrap Averaging , 2003 .

[24]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[25]  Joydeep Ghosh,et al.  CLUMP: a scalable and robust framework for structure discovery , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[26]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[27]  William F. Punch,et al.  A Comparison of Resampling Methods for Clustering Ensembles , 2004, IC-AI.

[28]  Johannes Gehrke,et al.  Mining Very Large Databases , 1999, Computer.

[29]  Matthias Klusch,et al.  Distributed Clustering Based on Sampling Local Density Estimates , 2003, IJCAI.

[30]  H. Kriegel,et al.  Towards Effective and Efficient Distributed Clustering , 2003 .

[31]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[32]  David G. Stork,et al.  Pattern Classification , 1973 .

[33]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[34]  Joydeep Ghosh,et al.  Distributed Clustering with Limited Knowledge Sharing , 2022 .

[35]  Joachim M. Buhmann,et al.  Path-Based Clustering for Grouping of Smooth Curves and Texture Segmentation , 2003, IEEE Trans. Pattern Anal. Mach. Intell..