论文信息 - Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering

Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering

Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has employed one of two approaches: 1) Searchbased methods that utilize supervised data to guide the search for the best clustering, and 2) Similarity-based methods that use supervised data to adapt the underlying similarity metric used by the clustering algorithm. This paper presents a unifled approach based on the K-Means clustering algorithm that incorporates both of these techniques. Experimental results demonstrate that the combined approach generally produces better clusters than either of the individual approaches.

R. Mooney | M. Bilenko | Sugato Basu

[1] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[2] Jeff A. Bilmes,et al. A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[3] Olga Veksler,et al. Markov random fields with efficient approximations , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[4] Ayhan Demiriz,et al. Semi-Supervised Clustering Using Genetic Algorithms , 1999 .

[5] Thorsten Joachims,et al. Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[6] Éva Tardos,et al. Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[7] Claire Cardie,et al. Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[8] Dan Klein,et al. From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[9] Michael I. Jordan,et al. Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[10] Arindam Banerjee,et al. Semi-supervised Clustering by Seeding , 2002, ICML.

[11] Raymond J. Mooney,et al. Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[12] Andrew McCallum,et al. Semi-Supervised Clustering with User Feedback , 2003 .

[13] Inderjit S. Dhillon,et al. Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[14] Arindam Banerjee,et al. Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[15] Sebastian Thrun,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.