Clustering categorical data in projected spaces

The problem of clustering categorical data has been widely investigated and appropriate approaches have been proposed. However, the majority of the existing methods suffer from one or more of the following limitations: (1) difficulty detecting clusters of very low dimensionality embedded in high-dimensional spaces, (2) lack of an automatic mechanism for identifying relevant dimensions for each cluster, (3) lack of an outlier detection mechanism and (4) dependence on a set of parameters that need to be properly tuned. Most of the existing approaches are inadequate for dealing with these four issues in a unified framework. This motivates our effort to propose a fully automatic projected clustering algorithm for high-dimensional categorical data which is capable of facing the four aforementioned issues in a single framework. Our algorithm comprises two phases: (1) outlier handling and (2) clustering in projected spaces. The first phase of the algorithm is based on a probabilistic approach that exploits the beta mixture model to identify and eliminate outlier objects from a data set in a systematic way. In the second phase, the clustering process is based on a novel quality function that allows the identification of projected clusters of low dimensionality embedded in a high-dimensional space without any parameter setting by the user. The suitability of our proposal is demonstrated through empirical studies using synthetic and real data sets.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.

[3]  Michael Georgiopoulos,et al.  A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes , 2010, Data Mining and Knowledge Discovery.

[4]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[5]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Zengyou He,et al.  A Fast Greedy Algorithm for Outlier Mining , 2005, PAKDD.

[8]  Jesús S. Aguilar-Ruiz,et al.  A biclustering algorithm for extracting bit-patterns from binary datasets , 2011, Bioinform..

[9]  Martin Ester,et al.  Robust projected clustering , 2007, Knowledge and Information Systems.

[10]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[12]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[13]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[14]  Shengrui Wang,et al.  Mining Projected Clusters in High-Dimensional Spaces , 2009, IEEE Transactions on Knowledge and Data Engineering.

[15]  Jiye Liang,et al.  A novel attribute weighting algorithm for clustering high-dimensional categorical data , 2011, Pattern Recognit..

[16]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[17]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[18]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[19]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[20]  Shengrui Wang,et al.  An objective approach to cluster validation , 2006, Pattern Recognit. Lett..

[21]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[22]  A.M. Yip,et al.  Strategies for Identifying Statistically Significant Dense Regions in Microarray Data , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Jianhong Wu,et al.  Subspace clustering for high dimensional categorical data , 2004, SKDD.

[24]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[25]  Costas S. Tzafestas,et al.  Maximum Likelihood SLAM in Dynamic Environments , 2007 .

[26]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[27]  Nizar Bouguila,et al.  Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications , 2006, Stat. Comput..

[28]  Ira Assent,et al.  CLICKS: an effective algorithm for mining subspace clusters in categorical datasets , 2005, KDD '05.

[29]  KoufakouAnna,et al.  A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes , 2010 .

[30]  Yuan Ji,et al.  Applications of beta-mixture models in bioinformatics , 2005, Bioinform..

[31]  Minho Kim,et al.  Projected clustering for categorical datasets , 2006, Pattern Recognit. Lett..

[32]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[33]  Arne Leijon,et al.  Beta mixture models and the application to image classification , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[34]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[35]  Tjalling J. Ypma,et al.  Historical Development of the Newton-Raphson Method , 1995, SIAM Rev..

[36]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  Tengke Xiong,et al.  DHCC: Divisive hierarchical clustering of categorical data , 2011, Data Mining and Knowledge Discovery.

[38]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[39]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[40]  Philip S. Yu,et al.  Redefining Clustering for High-Dimensional Applications , 2002, IEEE Trans. Knowl. Data Eng..

[41]  Mohamed Bouguessa,et al.  An Unsupervised Approach for Identifying Spammers in Social Networks , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[42]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.