Central Clustering of Categorical Data with Automated Feature Weighting

The ability to cluster high-dimensional categorical data is essential for many machine learning applications such as bioinfomatics. Currently, central clustering of categorical data is a difficult problem due to the lack of a geometrically interpretable definition of a cluster center. In this paper, we propose a novel kernel-density-based definition using a Bayes-type probability estimator. Then, a new algorithm called k-centers is proposed for central clustering of categorical data, incorporating a new feature weighting scheme by which each attribute is automatically assigned with a weight measuring its individual contribution for the clusters. Experimental results on real-world data show outstanding performance of the proposed algorithm, especially in recognizing the biological patterns in DNA sequences.

[1]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[2]  Tao Li,et al.  Entropy-based criterion in categorical clustering , 2004, ICML.

[3]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[4]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[5]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[6]  B. Margolin,et al.  An Analysis of Variance for Categorical Data , 1971 .

[7]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Qi Li,et al.  Cross-validation and the estimation of probability distributions with categorical data , 2006 .

[9]  Witold Pedrycz,et al.  The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features , 2009, Fuzzy Sets Syst..

[10]  Shengrui Wang,et al.  Particle swarm optimizer for variable weighting in clustering high-dimensional data , 2009, 2009 IEEE Swarm Intelligence Symposium.

[11]  Zijiang Yang,et al.  PARTCAT: A Subspace Clustering Algorithm for High Dimensional Categorical Data , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[12]  Michael I. Jordan,et al.  On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[13]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[15]  Jude Shavlik,et al.  Refinement ofApproximate Domain Theories by Knowledge-Based Neural Networks , 1990, AAAI.

[16]  Pranab Kumar Sen,et al.  Gini diversity index, Hamming distance, and curse of dimensionality , 2005 .

[17]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[18]  Shengrui Wang,et al.  Model-Based Method for Projective Clustering , 2012, IEEE Transactions on Knowledge and Data Engineering.

[19]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[20]  Michael K. Ng,et al.  A Note on K-modes Clustering , 2003, J. Classif..

[21]  Doheon Lee,et al.  A k-populations algorithm for clustering categorical data , 2005, Pattern Recognit..

[22]  J. Aitchison,et al.  Multivariate binary discrimination by the kernel method , 1976 .

[23]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[24]  Jiye Liang,et al.  A novel attribute weighting algorithm for clustering high-dimensional categorical data , 2011, Pattern Recognit..

[25]  Qi Li,et al.  Nonparametric Econometrics: Theory and Practice , 2006 .

[26]  Tengke Xiong,et al.  DHCC: Divisive hierarchical clustering of categorical data , 2011, Data Mining and Knowledge Discovery.