论文信息 - Semi-supervised Parameter-Free Divisive Hierarchical Clustering of Categorical Data

Semi-supervised Parameter-Free Divisive Hierarchical Clustering of Categorical Data

Semi-supervised clustering can yield considerable improvement over unsupervised clustering. Most existing semi-supervised clustering algorithms are non-hierarchical, derived from the k-means algorithm and designed for analyzing numeric data. Clustering categorical data is a challenging issue due to the lack of inherently meaningful similarity measure, and semi-supervised clustering in the categorical domain remains untouched. In this paper, we propose a novel semi-supervised divisive hierarchical algorithm for categorical data. Our algorithm is parameter-free, fully automatic and effective in taking advantage of instance-level constraint background knowledge to improve the quality of the resultant dendrogram. Experiments on real-life data demonstrate the promising performance of our algorithm.

Tengke Xiong | Shengrui Wang | André Mayers | Ernest Monga

[1] Ohn Mar San,et al. An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[2] Eugenio Cesario,et al. Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[3] Pierre Hansen,et al. NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[4] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[5] M. Greenacre,et al. Multiple Correspondence Analysis and Related Methods , 2006 .

[6] Joshua Zhexue Huang,et al. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[7] S. S. Ravi,et al. Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results , 2009, Data Mining and Knowledge Discovery.

[8] Claire Cardie,et al. Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[9] Dan Klein,et al. From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[10] Inderjit S. Dhillon,et al. Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[11] Hui Xiong,et al. Enhancing semi-supervised clustering: a feature projection perspective , 2007, KDD '07.

[12] Inderjit S. Dhillon,et al. Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[13] Vipin Kumar,et al. Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[14] Sudipto Guha,et al. ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[15] Eamonn J. Keogh,et al. Towards parameter-free data mining , 2004, KDD.

[16] Vipin Kumar,et al. Introduction to Data Mining, (First Edition) , 2005 .

[17] Tengke Xiong,et al. A New MCA-Based Divisive Hierarchical Algorithm for Clustering Categorical Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[18] Jianhong Wu,et al. Subspace clustering for high dimensional categorical data , 2004, SKDD.

[19] Jörg Sander,et al. Semi-supervised Density-Based Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[20] Raymond J. Mooney,et al. A probabilistic framework for semi-supervised clustering , 2004, KDD.

[21] S. S. Ravi,et al. Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.