Document Clustering With Dual Supervision Through Feature Reweighting

Traditional semi‐supervised clustering uses only limited user supervision in the form of instance seeds for clusters and pairwise instance constraints to aid unsupervised clustering. However, user supervision can also be provided in alternative forms for document clustering, such as labeling a feature by indicating whether it discriminates among clusters. This article thus fills this void by enhancing traditional semi‐supervised clustering with feature supervision, which asks the user to label discriminating features during defining (labeling) the instance seeds or pairwise instance constraints. Various types of semi‐supervised clustering algorithms were explored with feature supervision. Our experimental results on several real‐world data sets demonstrate that augmenting the instance‐level supervision with feature‐level supervision can significantly improve document clustering performance.

[1]  Tom M. Mitchell,et al.  Text clustering with extended user feedback , 2006, SIGIR.

[2]  Tetsuya Yoshida,et al.  A GRAPH‐BASED APPROACH FOR SEMISUPERVISED CLUSTERING , 2014, Comput. Intell..

[3]  Kien A. Hua,et al.  Constrained locally weighted clustering , 2008, Proc. VLDB Endow..

[4]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[5]  Hema Raghavan,et al.  InterActive Feature Selection , 2005, IJCAI.

[6]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[7]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[8]  James Blustein,et al.  Interactive feature selection for document clustering , 2011, SAC.

[9]  Wai Lam,et al.  An active learning framework for semi-supervised document clustering with language modeling , 2009, Data Knowl. Eng..

[10]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[11]  Xiang Ji,et al.  Document clustering with prior knowledge , 2006, SIGIR.

[12]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[13]  Philip S. Yu,et al.  Text Classification by Labeling Words , 2004, AAAI.

[14]  James Blustein,et al.  Enhancing semi-supervised document clustering with feature supervision , 2012, SAC '12.

[15]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[16]  Hui Xiong,et al.  Enhancing semi-supervised clustering: a feature projection perspective , 2007, KDD '07.

[17]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[18]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[19]  Foster J. Provost,et al.  A Unified Approach to Active Dual Supervision for Labeling Features and Examples , 2010, ECML/PKDD.