A Pseudo-document-based Topical N-grams model for short texts

In recent years, short text topic modeling has drawn considerable attentions from interdisciplinary researchers. Various customized topic models have been proposed to tackle the semantic sparseness nature of short texts. Most (if not all) of them follow the bag-of-words assumption, which, however, is not adequate since word order and phrases are often critical to capturing the meaning of texts. On the other hand, while some existing topic models are sensitive to word order, they do not perform well on short texts due to the severe data sparseness. To address these issues, we propose the Pseudo-document-based Topical N-Grams model (PTNG), which alleviates the data sparsity problem of short texts while is sensitive to word order. Extensive experiments on three real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTNG according to UCI coherence scores and more discriminative semantic representation of short texts according to classification results.

[1]  Hui Xiong,et al.  Topic Modeling of Short Texts: A Pseudo-Document View , 2016, KDD.

[2]  Yulan He,et al.  Extracting Topical Phrases from Clinical Documents , 2016, AAAI.

[3]  Qiaozhu Mei,et al.  One theme in all views: modeling consensus topics in multiple contexts , 2013, KDD.

[4]  Jian Yang,et al.  Using time-sensitive interactions to improve topic derivation in twitter , 2016, World Wide Web.

[5]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[6]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[7]  Timothy Baldwin,et al.  On collocations and topic models , 2013, TSLP.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Philip S. Yu,et al.  A topic model for co-occurring normal documents and short texts , 2018, World Wide Web.

[10]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[11]  Robert V. Lindsey,et al.  A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes , 2012, EMNLP.

[12]  Qiaozhu Mei,et al.  Less is More: Learning Prominent and Diverse Topics for Data Summarization , 2016, ArXiv.

[13]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[14]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[15]  Yanchun Zhang,et al.  An Efficient Method for High Quality and Cohesive Topical Phrase Mining , 2019, IEEE Transactions on Knowledge and Data Engineering.

[16]  Xiuzhen Zhang,et al.  A probabilistic method for emerging topic tracking in Microblog stream , 2016, World Wide Web.

[17]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[18]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[19]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[20]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[21]  Heng Ji,et al.  Harnessing web page directories for large-scale classification of tweets , 2013, WWW '13 Companion.

[22]  Andrew McCallum,et al.  A Note on Topical N-grams , 2005 .

[23]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[24]  Yinan Zhang,et al.  A phrase mining framework for recursive construction of a topical hierarchy , 2013, KDD.

[25]  Hong Cheng,et al.  The dual-sparse topic model: mining focused topics and focused terms in short text , 2014, WWW.

[26]  Bin Wang,et al.  CITPM: A Cluster-Based Iterative Topical Phrase Mining Framework , 2016, DASFAA.

[27]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[28]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[29]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[31]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[32]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[33]  Aixin Sun,et al.  Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2017, ACM Trans. Inf. Syst..

[34]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[35]  Yuan Zuo,et al.  Word network topic model: a simple but general solution for short and imbalanced texts , 2014, Knowledge and Information Systems.

[36]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[37]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[38]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.