Semi-supervised concept factorization for document clustering

Nonnegative Matrix Factorization (NMF) and Concept Factorization (CF) are two popular methods for finding the low-rank approximation of nonnegative matrix. Different from NMF, CF can be applied not only to the matrix containing negative values but also to the kernel space. Based on NMF and CF, many methods, such as Graph regularized Nonnegative Matrix Factorization (GNMF) and Locally Consistent Clustering Factorization (LCCF) can significantly improve the performance of clustering. Unfortunately, these are unsupervised learning methods. In order to enhance the clustering performance with the supervisory information, a Semi-Supervised Concept Factorization (SSCF) is proposed in this paper by incorporating the pairwise constraints into CF as the reward and penalty terms, which can guarantee that the data points belonging to a cluster in the original space are still in the same cluster in the transformed space. By comparing with the state-of-the-arts algorithms (KM, NMF, CF, LCCF, GNMF, PCCF), experimental results on document clustering show that the proposed algorithm has better performance in terms of accuracy and mutual information.

[1]  Xiaohua Hu,et al.  Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering , 2006, KDD '06.

[2]  Seungjin Choi,et al.  Semi-Supervised Nonnegative Matrix Factorization , 2010, IEEE Signal Processing Letters.

[3]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[4]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, ICML '05.

[5]  Xuelong Li,et al.  Constrained Nonnegative Matrix Factorization for Image Representation , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  C. A. Murthy,et al.  A similarity assessment technique for effective grouping of documents , 2015, Inf. Sci..

[7]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[8]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[9]  Fei Wang,et al.  Semi-Supervised Clustering via Matrix Factorization , 2008, SDM.

[10]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[11]  Hongtao Lu,et al.  Pairwise constrained concept factorization for data representation , 2014, Neural Networks.

[12]  Jing-Yu Yang,et al.  Test cost sensitive multigranulation rough set: Model and minimal cost selection , 2013, Inf. Sci..

[13]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[14]  Xiaojun Wu,et al.  Graph Regularized Nonnegative Matrix Factorization for Data Representation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[16]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[17]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[18]  Jianguo Jiang,et al.  Automatic image annotation by semi-supervised manifold kernel density estimation , 2014, Inf. Sci..

[19]  Nicoletta Del Buono,et al.  Non-negative Matrix Tri-Factorization for co-clustering: An analysis of the block matrix , 2015, Inf. Sci..

[20]  Fuzhen Zhuang,et al.  Combining Supervised and Unsupervised Models via Unconstrained Probabilistic Embedding , 2011, IJCAI.

[21]  Yi Yang,et al.  Discriminative Orthogonal Nonnegative matrix factorization with flexibility for data representation , 2014, Expert Syst. Appl..

[22]  Ujjwal Maulik,et al.  Incremental learning based multiobjective fuzzy clustering for categorical data , 2014, Inf. Sci..

[23]  Witold Pedrycz,et al.  A Clustering-Based Graph Laplacian Framework for Value Function Approximation in Reinforcement Learning , 2014, IEEE Transactions on Cybernetics.

[24]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[25]  Nizar Grira,et al.  Unsupervised and Semi-supervised Clustering : a Brief Survey ∗ , 2004 .

[26]  Xiang Ji,et al.  Document clustering with prior knowledge , 2006, SIGIR.

[27]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[28]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[29]  Jing Hua,et al.  Non-negative matrix factorization for semi-supervised data clustering , 2008, Knowledge and Information Systems.

[30]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[31]  S. Palmer Hierarchical structure in perceptual representation , 1977, Cognitive Psychology.

[32]  Shiqiang Du,et al.  Graph Regularized Semi-Supervised Concept Factorization , 2012 .

[33]  Zhaohui Wu,et al.  Constrained Concept Factorization for Image Representation , 2014, IEEE Transactions on Cybernetics.