Efficient Topic Modeling on Phrases via Sparsity

Topic modeling on phrases is important in understanding documents by providing interpretable topics. But existing methods are not as efficient as the topic modeling methods on words, which may limit their potential application.Towards providing a more efficient method, we propose a novel topic model SparseTP, which (1) models the words and phrases by linking them in Markov Random Field when necessary; (2) provides a well-formed lower bound of the model for Gibbs sampling; (3) utilizes the sparse distribution of words and phrases on topics to speed up the inference. The experiments demonstrate that it can achieve the high efficiency without sacrificing the effectiveness.

[1]  Yulan He,et al.  Extracting Topical Phrases from Clinical Documents , 2016, AAAI.

[2]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Stephen M. Smith,et al.  Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm , 2001, IEEE Transactions on Medical Imaging.

[4]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[5]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[6]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[7]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[8]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[9]  Yazhou Wang,et al.  Category-Level Transfer Learning from Knowledge Base to Microblog Stream for Accurate Event Detection , 2017, DASFAA.

[10]  Robert V. Lindsey,et al.  A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes , 2012, EMNLP.

[11]  Yang Gao,et al.  Towards Topic Modeling for Big Data , 2014, ArXiv.

[12]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[13]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[14]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[15]  Wai Lam,et al.  An unsupervised topic segmentation model incorporating word order , 2013, SIGIR.

[16]  Wenguang Chen,et al.  WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation , 2015, Proc. VLDB Endow..

[17]  Jiawei Han,et al.  Automated Phrase Mining from Massive Text Corpora , 2017, IEEE Transactions on Knowledge and Data Engineering.

[18]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[19]  Miles Osborne,et al.  Using paraphrases for improving first story detection in news and Twitter , 2012, HLT-NAACL.

[20]  Noriaki Kawamae,et al.  Supervised N-gram topic model , 2014, WSDM.

[21]  Diyi Yang,et al.  Incorporating Word Correlation Knowledge into Topic Modeling , 2015, NAACL.