A topic model for co-occurring normal documents and short texts

User comments, as a large group of online short texts, are becoming increasingly prevalent with the development of online communications. These short texts are characterized by their co-occurrences with usually lengthier normal documents. For example, there could be multiple user comments following one news article, or multiple reader reviews following one blog post. The co-occurring structure inherent in such text corpora is important for efficient learning of topics, but is rarely captured by conventional topic models. To capture such structure, we propose a topic model for co-occurring documents, referred to as COTM. In COTM, we assume there are two sets of topics: formal topics and informal topics, where formal topics can appear in both normal documents and short texts whereas informal topics can only appear in short texts. Each normal document has a probability distribution over a set of formal topics; each short text is composed of two topics, one from the set of formal topics, whose selection is governed by the topic probabilities of the corresponding normal document, and the other from a set of informal topics. We also develop an online algorithm for COTM to deal with large scale corpus. Extensive experiments on real-world datasets demonstrate that COTM and its online algorithm outperform state-of-art methods by discovering more prominent, coherent and comprehensive topics.

[1]  Gao Cong,et al.  Topic-driven reader comments summarization , 2012, CIKM.

[2]  Jin Xu,et al.  A Topic Model for Hierarchical Documents , 2016, 2016 IEEE First International Conference on Data Science in Cyberspace (DSC).

[3]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[4]  Himabindu Lakkaraju,et al.  Dynamic Multi-relational Chinese Restaurant Process for Analyzing Influences on Users in Social Media , 2012, 2012 IEEE 12th International Conference on Data Mining.

[5]  Andrew McCallum,et al.  Joint Group and Topic Discovery from Relations and Text , 2006, SNA@ICML.

[6]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[7]  nbspPreeti Nakum,et al.  Survey on review SPAM detection , 2016 .

[8]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[9]  Taghi M. Khoshgoftaar,et al.  Survey of review spam detection using machine learning techniques , 2015, Journal of Big Data.

[10]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[11]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[15]  Jun'ichi Tsujii,et al.  A Latent Concept Topic Model for Robust Topic Inference Using Word Embeddings , 2016, ACL.

[16]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[17]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[18]  Sneha Dixit,et al.  SURVEY ON REVIEW SPAM DETECTION , 2013 .

[19]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[20]  Nagarajan Natarajan,et al.  Community detection in content-sharing social networks , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[21]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[22]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[23]  Susumu Horiguchi,et al.  A Hidden Topic-Based Framework toward Building Applications with Short Web Documents , 2011, IEEE Transactions on Knowledge and Data Engineering.

[24]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[25]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[26]  Jiafeng Guo,et al.  BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[27]  Hui Xiong,et al.  Topic Modeling of Short Texts: A Pseudo-Document View , 2016, KDD.

[28]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[29]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[30]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[31]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.