Enhancing semi-supervised clustering: a feature projection perspective

Semi-supervised clustering employs limited supervision in the form of labeled instances or pairwise instance constraints to aid unsupervised clustering and often significantly improves the clustering performance. Despite the vast amount of expert knowledge spent on this problem, most existing work is not designed for handling high-dimensional sparse data. This paper thus fills this crucial void by developing a Semi-supervised Clustering method based on spheRical K-mEans via fEature projectioN (SCREEN). Specifically, we formulate the problem of constraint-guided feature projection, which can be nicely integrated with semi-supervised clustering algorithms and has the ability to effectively reduce data dimension. Indeed, our experimental results on several real-world data sets show that the SCREEN method can effectively deal with high-dimensional data and provides an appealing clustering performance.

[1]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[2]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[3]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[4]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[5]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[6]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[7]  Raghu Ramakrishnan,et al.  When Is Nearest Neighbors Indexable? , 2005, ICDT.

[8]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[9]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[10]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[11]  Michael K. Ng,et al.  On discovery of extremely low-dimensional clusters using semi-supervised projected clustering , 2005, 21st International Conference on Data Engineering (ICDE'05).

[12]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[13]  Christopher J. C. Burges,et al.  Geometric Methods for Feature Extraction and Dimensional Reduction - A Guided Tour , 2005, Data Mining and Knowledge Discovery Handbook.

[14]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[15]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[16]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[17]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[18]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[21]  David G. Stork,et al.  Pattern Classification , 1973 .

[22]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[23]  Nello Cristianini,et al.  Efficiently Learning the Metric with Side-Information , 2003, ALT.

[24]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[25]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[26]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[27]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[28]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[29]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[30]  Andrew McCallum,et al.  Semi-Supervised Clustering with User Feedback , 2003 .