Clustering Short-Text Using Non-negative Matrix Factorization of Hadamard Product of Similarities

Short-texts mining has become an important area of research in IR and data mining. Ncut-term weighting is recently proposed for clustering of short-texts using non-negative matrix factorization. Non-negative factorization can be employed for such term weighting when the similarity measure is the inner product of term-document matrix. We propose a new weighting scheme and devise a new clustering algorithm using Hadamard product of similarity matrices. We demonstrate that our technique yields much better clustering in comparison to ncut weighting scheme. We use three measures for evaluating clustering qualities, namely purity, normalized mutual information and adjusted Rand index. We use standard benchmark datasets and also compare the performance of our algorithm with well-known document clustering technique of Ng-Jordan-Weiss. Experimental results suggest that the weighting process by Hadamard product gives better clustering of document of short-texts.

[1]  Roman Grundkiewicz,et al.  Automatic Extraction of Polish Language Errors from Text Edition History , 2013, TSD.

[2]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[3]  Xiaohui Yan,et al.  Clustering short text using Ncut-weighted non-negative matrix factorization , 2012, CIKM.

[4]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[5]  Kuldip K. Paliwal,et al.  Intrusion detection using text processing techniques with a kernel based similarity measure , 2007, Comput. Secur..

[6]  William W. Cohen,et al.  Power Iteration Clustering , 2010, ICML.

[7]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[8]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[9]  Alexander F. Gelbukh,et al.  Clustering Abstracts Instead of Full Texts , 2004, TSD.

[10]  Arun K. Pujari,et al.  Frequency- and ordering-based similarity measure for host-based intrusion detection , 2004, Inf. Manag. Comput. Secur..

[11]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[13]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[14]  David Eduardo,et al.  On Clustering and Evaluation of Narrow Domain Short-Test Corpora , 2009 .

[15]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Rong Jin,et al.  Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall , 2001, SIGIR '01.

[17]  Lada A. Adamic,et al.  The political blogosphere and the 2004 U.S. election: divided they blog , 2005, LinkKDD '05.

[18]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .