论文信息 - Central Clustering of Categorical Data with Automated Feature Weighting

Central Clustering of Categorical Data with Automated Feature Weighting

The ability to cluster high-dimensional categorical data is essential for many machine learning applications such as bioinfomatics. Currently, central clustering of categorical data is a difficult problem due to the lack of a geometrically interpretable definition of a cluster center. In this paper, we propose a novel kernel-density-based definition using a Bayes-type probability estimator. Then, a new algorithm called k-centers is proposed for central clustering of categorical data, incorporating a new feature weighting scheme by which each attribute is automatically assigned with a weight measuring its individual contribution for the clusters. Experimental results on real-world data show outstanding performance of the proposed algorithm, especially in recognizing the biological patterns in DNA sequences.

Shengrui Wang | Lifei Chen

[1] Sudipto Guha,et al. ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[2] Tao Li,et al. Entropy-based criterion in categorical clustering , 2004, ICML.

[3] Eugenio Cesario,et al. Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[4] C. Harley,et al. Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[5] Michael K. Ng,et al. An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[6] B. Margolin,et al. An Analysis of Variance for Categorical Data , 1971 .

[7] Michael K. Ng,et al. An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8] Qi Li,et al. Cross-validation and the estimation of probability distributions with categorical data , 2006 .

[9] Witold Pedrycz,et al. The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features , 2009, Fuzzy Sets Syst..

[10] Shengrui Wang,et al. Particle swarm optimizer for variable weighting in clustering high-dimensional data , 2009, 2009 IEEE Swarm Intelligence Symposium.

[11] Zijiang Yang,et al. PARTCAT: A Subspace Clustering Algorithm for High Dimensional Categorical Data , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[12] Michael I. Jordan,et al. On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[13] Michael K. Ng,et al. Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Ohn Mar San,et al. An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[15] Jude Shavlik,et al. Refinement ofApproximate Domain Theories by Knowledge-Based Neural Networks , 1990, AAAI.

[16] Pranab Kumar Sen,et al. Gini diversity index, Hamming distance, and curse of dimensionality , 2005 .

[17] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[18] Shengrui Wang,et al. Model-Based Method for Projective Clustering , 2012, IEEE Transactions on Knowledge and Data Engineering.

[19] Jude W. Shavlik,et al. Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[20] Michael K. Ng,et al. A Note on K-modes Clustering , 2003, J. Classif..

[21] Doheon Lee,et al. A k-populations algorithm for clustering categorical data , 2005, Pattern Recognit..

[22] J. Aitchison,et al. Multivariate binary discrimination by the kernel method , 1976 .

[23] Sudipto Guha,et al. ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[24] Jiye Liang,et al. A novel attribute weighting algorithm for clustering high-dimensional categorical data , 2011, Pattern Recognit..

[25] Qi Li,et al. Nonparametric Econometrics: Theory and Practice , 2006 .

[26] Tengke Xiong,et al. DHCC: Divisive hierarchical clustering of categorical data , 2011, Data Mining and Knowledge Discovery.