HISSCLU: a hierarchical density-based method for semi-supervised clustering

In situations where class labels are known for a part of the objects, a cluster analysis respecting this information, i.e. semi-supervised clustering, can give insight into the class and cluster structure of a data set. Several semi-supervised clustering algorithms such as HMRF-K-Means [4], COP-K-Means [26] and the CCL-algorithm [18] have recently been proposed. Most of them extend well-known clustering methods (K-Means [22], Complete Link [17] by enforcing two types of constraints: must-links between objects of the same class and cannot-links between objects of different classes. In this paper, we propose HISSCLU, a hierarchical, density-based method for semi-supervised clustering. Instead of deriving explicit constraints from the labeled objects, HISSCLU expands the clusters starting at all labeled objects simultaneously. During the expansion, class labels are assigned to the unlabeled objects most consistently with the cluster structure. Using this information the hierarchical cluster structure is determined. The result is visualized in a semi-supervised cluster diagram showing both cluster structure as well as class assignment. Compared to methods based on must-links and cannot-links, our method allows a better preservation of the actual cluster structure, particularly if the data set contains several distinct clusters of the same class (i.e. the intra-class data distribution is multimodal). HISSCLU has a determinate result, is efficient and robust against noise. The performance of our algorithm is shown in an extensive experimental evaluation on synthetic and real-world data sets.

[1]  Christian Böhm,et al.  Enhancing instance-based classification with local density: a new algorithm for classifying unbalanced biomedical data , 2006, Bioinform..

[2]  Hong Liu,et al.  Evolutionary semi-supervised fuzzy clustering , 2003, Pattern Recognit. Lett..

[3]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[4]  Claudio Gentile,et al.  Incremental Algorithms for Hierarchical Classification , 2004, J. Mach. Learn. Res..

[5]  J. Heitman,et al.  Nuclear protein localization. , 1991, Biochimica et biophysica acta.

[6]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[7]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[8]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[9]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[10]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[11]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[12]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[13]  Zhengdong Lu,et al.  Semi-supervised Learning with Penalized Probabilistic Clustering , 2004, NIPS.

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  Stefan Kramer,et al.  Ensembles of Balanced Nested Dichotomies for Multi-class Problems , 2005, PKDD.

[16]  Christian Böhm,et al.  Supervised machine learning techniques for the classification of metabolic disorders in newborns , 2004, Bioinform..

[17]  Johannes Fürnkranz,et al.  Round Robin Classification , 2002, J. Mach. Learn. Res..

[18]  Claudio Gentile,et al.  Hierarchical classification: combining Bayes with SVM , 2006, ICML.

[19]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[20]  Zhiyong Lu,et al.  Automatic Extraction of Clusters from Hierarchical Clustering Representations , 2003, PAKDD.

[21]  Hongyu Li,et al.  Outlier Detection in Benchmark Classification Tasks , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[22]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[23]  Ming-Syan Chen,et al.  On the Techniques for Data Clustering with Numerical Constraints , 2003, SDM.

[24]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[25]  BaumgartnerChristian,et al.  Enhancing instance-based classification with local density , 2006 .

[26]  Yoram Singer,et al.  Large margin hierarchical classification , 2004, ICML.

[27]  Christoph F. Eick,et al.  Supervised clustering - algorithms and benefits , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.