Mining Representative Subspace Clusters in High-dimensional Data

A major challenge in subspace clustering is that subspace clustering may generate an explosive number of clusters with high computational complexity, which severely restricts the usage of subspace clustering. The problem gets even worse with the increase of the data’s dimensionality. In this paper, we propose to mine the representative subspace clusters in high-dimensional data to alleviate the problem. Typically, subspace clusters can be clustered further into groups, and several representative clusters can be generated from each group. Unfortunately, when the size of the set of representative clusters is specified, the problem of finding the optimal set is NP-hard. To solve this problem efficiently, we present an approximate method PCoC. The greatest advantage of our method is that we only need a subset of subspace clusters as the input. Our performance study shows the effectiveness and efficiency of the method.

[1]  Yang Xiang,et al.  Effective and efficient itemset pattern summarization: regression-based approaches , 2008, KDD.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  Hans-Peter Kriegel,et al.  Subspace selection for clustering high-dimensional data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[4]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[5]  Hans-Peter Kriegel,et al.  A generic framework for efficient subspace clustering of high-dimensional data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[7]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[8]  Shiwei Tang,et al.  Discovering the Skyline of Subspace Clusters in High-Dimensional Data , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[9]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[10]  Srinivasan Parthasarathy,et al.  Summarizing itemset patterns using probabilistic models , 2006, KDD '06.

[11]  Aristides Gionis,et al.  Approximating a collection of frequent sets , 2004, KDD.

[12]  Ira Assent,et al.  DUSC: Dimensionality Unbiased Subspace Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[13]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[14]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.