Corpus-based topic diffusion for short text clustering

In this paper, we propose a novel corpus-based enrichment approach for short text clustering. Since sparseness brings about the problem of insufficient word co-occurrence and lack of context information, previous researches use external sources such as Wikipedia or WordNet to enrich the representation of short text documents, which requires extra resources and might lead to possible inconsistency. On the other hand, corpus-based approaches use no external information in mining short text data. By introducing a set of conjugate definitions to characterize the structures of topics and words, and by proposing a virtual generative procedure for short texts, we perform expansion on short text data. Specifically, new words which may not appear in a short text document were added with a virtual term frequency, and this virtual frequency is obtained from the posterior probabilities of new words given all the words in that document. The complete procedure can be regarded as mapping data points (documents) from the original feature space to a hidden semantic space (topic space). After performing semantic smoothing, data points are then mapped back to the original space. We conduct experiments on two short text datasets, and the results show that the proposed method can effectively address the sparseness problem. For these datasets, our method, using only a basic clustering algorithm, attains a comparable performance with methods based on enrichment with external information sources.

[1]  Peng Wang,et al.  A robust framework for short text categorization based on topic model and integrated classifier , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[2]  Katrina Fenlon,et al.  Improving retrieval of short texts through document expansion , 2012, SIGIR '12.

[3]  Peng Wang,et al.  Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification , 2016, Neurocomputing.

[4]  Mengen Chen,et al.  Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[5]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[6]  Paolo Rosso,et al.  A Self-enriching Methodology for Clustering Narrow Domain Short Texts , 2011, Comput. J..

[7]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[8]  Hui He,et al.  Short Text Feature Extraction and Clustering for Web Topic Mining , 2007 .

[9]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[10]  Ali Bou Nassif,et al.  Data mining techniques in social media: A survey , 2016, Neurocomputing.

[11]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[12]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[13]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[14]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[15]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[16]  Paolo Ferragina,et al.  Classification of Short Texts by Deploying Topical Annotations , 2012, ECIR.

[17]  Vivek Kumar Rangarajan Sridhar,et al.  Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words , 2015, VS@HLT-NAACL.

[18]  Xiaohui Yan,et al.  Clustering short text using Ncut-weighted non-negative matrix factorization , 2012, CIKM.

[19]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[20]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[21]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[22]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[23]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[24]  Keiji Yanai,et al.  Event photo mining from Twitter using keyword bursts and image clustering , 2016, Neurocomputing.

[25]  Ata Kabán,et al.  On an equivalence between PLSI and LDA , 2003, SIGIR.

[26]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[27]  Fakhri Karray,et al.  Short-Text Clustering using Statistical Semantics , 2015, WWW.

[28]  Susumu Horiguchi,et al.  A Hidden Topic-Based Framework toward Building Applications with Short Web Documents , 2011, IEEE Transactions on Knowledge and Data Engineering.

[29]  Arnold L. Rosenberg,et al.  Finding topic words for hierarchical summarization , 2001, SIGIR '01.

[30]  Aixin Sun,et al.  Short text classification using very few words , 2012, SIGIR '12.

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[32]  Huan Liu,et al.  Document Clustering via Matrix Representation , 2011, 2011 IEEE 11th International Conference on Data Mining.

[33]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[34]  Lin Li,et al.  Improving Short Text Clustering Performance with Keyword Expansion , 2009, ISNN.