Topic Detection from Short Text: A Term-based Consensus Clustering method

The process of Topic Detection from Short Text Systems (SMS) is to extract distinct topics hidden inside short text collections, such as Twitter, Weibo, and instant messages. With the recent emergence of large volume user generated content collections enabled by online social media, topic detection from SMS becomes a challenging yet promising means for online public opinion analysis. In available literature, many forms and methods of topic detection have been proposed, but obtaining meaningful and coherent data is still difficult to reliably obtain for the extreme sparsity brought by SMS. To this end, we developed a Term-based Consensus Clustering topic detection (TCC) framework to provide an unsupervised methodology for finding distinct topics from within SMS collections. Specifically, we adopt a consensus clustering technique called K-means-based Consensus Clustering to handle SMS clustering, due to its low computational complexity and robust clustering performance. To further enrich the features of the information of the sparse SMS data, we conduct term clustering in the highly dense term space instead of the conventionally targeted sparse document space. To be more specific, we first use a feature space transfer technique to represent short text collections as a pseudo-document matrix, where rows, namely instances, correspond to terms and columns, namely features, correspond to adjacent terms. Basic partitions are generated from the pseudo-document matrix for term clustering and consensus clustering is followed to obtain the final term clustering result. Finally, a document classification process is adopted and a document is assigned to a cluster, where most terms occurred. Extensive experiments on real-world data sets demonstrate that TCC is comparable to several widely used methods in terms of topic detection quality. Particularly, we demonstrate that TCC obtains best clustering performance when observing a large number of the predefined topics across short text collections.

[1]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[2]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[3]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[4]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[5]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[6]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[7]  Fakhri Karray,et al.  Short-Text Clustering using Statistical Semantics , 2015, WWW.

[8]  Young-Woo Seo,et al.  Text clustering for topic detection , 2004 .

[9]  Chris H. Q. Ding,et al.  Knowledge transformation from word space to document space , 2008, SIGIR '08.

[10]  Boris G. Mirkin,et al.  Reinterpreting the Category Utility Function , 2001, Machine Learning.

[11]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[12]  Yuan Zuo,et al.  Word network topic model: a simple but general solution for short and imbalanced texts , 2014, Knowledge and Information Systems.

[13]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[14]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[15]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[16]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  Hui Xiong,et al.  K-Means-Based Consensus Clustering: A Unified View , 2015, IEEE Transactions on Knowledge and Data Engineering.

[19]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[20]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[21]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[23]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[24]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.