Robust projected clustering

Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many overlapping clusters. Such algorithms have been extensively studied for numerical data, but only a few have been proposed for categorical data. Typical drawbacks of existing projected and subspace clustering algorithms for numerical or categorical data are that they rely on parameters whose appropriate values are difficult to set appropriately or that they are unable to identify projected clusters with few relevant attributes. We present P3C, a robust algorithm for projected clustering that can effectively discover projected clusters in the data while minimizing the number of required parameters. P3C does not need the number of projected clusters as input, and can discover, under very general conditions, the true number of projected clusters. P3C is effective in detecting very low-dimensional projected clusters embedded in high dimensional spaces. P3C positions itself between projected and subspace clustering in that it can compute both disjoint or overlapping clusters. P3C is the first projected clustering algorithm for both numerical and categorical data.

[1]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[2]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[3]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[4]  Jianyong Wang,et al.  On efficiently summarizing categorical databases , 2005, Knowledge and Information Systems.

[5]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[6]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Man Lung Yiu,et al.  Frequent-pattern based iterative projected clustering , 2003, Third IEEE International Conference on Data Mining.

[9]  Raymond Chi-Wing Wong,et al.  Projective clustering by histograms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Michael K. Ng,et al.  On discovery of extremely low-dimensional clusters using semi-supervised projected clustering , 2005, 21st International Conference on Data Engineering (ICDE'05).

[11]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[12]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[13]  Martin Ester,et al.  P3C: A Robust Projected Clustering Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[15]  G. W. Snedecor STATISTICAL METHODS , 1967 .

[16]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[17]  Hans-Peter Kriegel,et al.  A generic framework for efficient subspace clustering of high-dimensional data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[18]  Jianhong Wu,et al.  Subspace clustering for high dimensional categorical data , 2004, SKDD.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Man Lung Yiu,et al.  Iterative projected clustering by subspace mining , 2005, IEEE Transactions on Knowledge and Data Engineering.

[21]  Jian Tang,et al.  Capabilities of outlier detection schemes in large datasets, framework and methodologies , 2006, Knowledge and Information Systems.

[22]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[23]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[24]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[25]  Daniel A. Keim,et al.  A General Approach to Clustering in Large Databases with Noise , 2003, Knowledge and Information Systems.

[26]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[27]  Ira Assent,et al.  CLICKS: an effective algorithm for mining subspace clusters in categorical datasets , 2005, KDD '05.