Random Subspace Ensembles for Clustering Categorical Data

Cluster ensembles provide a solution to challenges inherent to clustering arising from its ill-posed nature. In fact, cluster ensembles can find robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging out spurious structures that arise due to the various biases to which each participating algorithm is tuned. In this chapter we focus on the design of ensembles for categorical data. Our techniques build upon diverse input clusterings discovered in random subspaces, and reduce the problem of defining a consensus function to a graph partitioning problem. We experimentally demonstrate the efficacy of our approach in combination with the categorical clustering algorithm COOLCAT.

[1]  Jiawei Han,et al.  Generating semantic annotations for frequent patterns with context analysis , 2006, KDD '06.

[2]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[3]  E. Kleinberg An overtraining-resistant stochastic modeling method for pattern recognition , 1996 .

[4]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[5]  Jianhong Wu,et al.  Subspace clustering for high dimensional categorical data , 2004, SKDD.

[6]  M. Aldenderfer Cluster Analysis , 1984 .

[7]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[8]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[9]  Zengyou He,et al.  Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach , 2005, ArXiv.

[10]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[11]  Xiaohua Hu,et al.  Integration of cluster ensemble and text summarization for gene expression analysis , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[12]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[13]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[14]  Mohamed S. Kamel,et al.  Finding Natural Clusters Using Multi-clusterer Combiner Based on Shared Nearest Neighbors , 2003, Multiple Classifier Systems.

[15]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[16]  Ludmila I. Kuncheva,et al.  Using diversity in cluster ensembles , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[17]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[18]  E. M. Kleinberg,et al.  Stochastic discrimination , 1990, Annals of Mathematics and Artificial Intelligence.

[19]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[20]  Carlotta Domeniconi,et al.  Weighted Clustering Ensembles , 2006, SDM.

[21]  Robert P. W. Duin,et al.  Bagging and the Random Subspace Method for Redundant Feature Spaces , 2001, Multiple Classifier Systems.

[22]  Zengyou He,et al.  A cluster ensemble method for clustering categorical data , 2005, Information Fusion.

[23]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..