Semi-supervised Microblog Clustering Method via Dual Constraints

In this paper, we present a semi-supervised clustering method for microblog in which both word-level and microblog document-level constraints are automatically generated totally based on statistical information rather than any kind of external knowledge. The key idea is first to explore term correlation data, which investigates both inter and intra correlation of words, and the initial similarity between words can therefore be deduced. And then an iterative method is established to calculate both word similarity and microblog similarity. The mechanism of incorporating dual constraints is presented based on word similarity and microblog similarity. We then formulate short text clustering problem as a non-negative matrix factorization based on dual constraints. Empirical study of two real-world dataset shows the superior performance of our framework in handling noisy and microblogs.

[1]  Huifang Ma,et al.  Orthogonal Nonnegative Matrix Tri-factorization for Semi-supervised Document Co-clustering , 2010, PAKDD.

[2]  Wenyin Liu,et al.  A short text modeling method combining semantic and statistical information , 2010, Inf. Sci..

[3]  Longbing Cao,et al.  Coupled term-term relation analysis for document clustering , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[4]  Wouter Weerkamp,et al.  Semi-Supervised Priors for Microblog Language Identification , 2011 .

[5]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[6]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[7]  Huifang Ma,et al.  Semi-supervised Nonnegative Matrix Factorization for Microblog Clustering Based on Term Correlation , 2014, APWeb.

[8]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[9]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[10]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[11]  Huifang Ma,et al.  A nonnegative matrix factorization framework for semi-supervised document clustering with dual constraints , 2012, Knowledge and Information Systems.

[12]  Xiaohui Yan,et al.  Clustering short text using Ncut-weighted non-negative matrix factorization , 2012, CIKM.