A time-series based aggregation scheme for topic detection in Weibo short texts

Abstract Discovering hot topics within social network like Twitter and Weibo, has received much attention in recent years. While topic models such as Latent Dirichlet Allocation (LDA) have been successfully applied in topic discovery, they are often less coherent when applied to microblog content which is known as “posts”. In this paper, we propose a time-series based aggregation scheme for topic modeling in Weibo. As Weibo topics are coherent within a time slice, we divide Weibo dataset into groups by time slice. With this scheme, posts in every group are aggregated into several longer pseudo-documents using paragraph-vector based similarity algorithms. While applying this scheme to LDA model, we dramatically decrease the topic model perplexity and increase the clustering quality, which also allows for better discovery of underlying topics in Weibo. Furthermore, we can let other topic models extended on LDA be directly used on such short texts.

[1]  Jian Shen,et al.  $$\varvec{\textit{KDVEM}}$$KDVEM: a $$k$$k-degree anonymity with vertex and edge modification algorithm , 2015, Computing.

[2]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[3]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[4]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[5]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[8]  Timothy Baldwin,et al.  On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online , 2012, COLING.

[9]  Yong Ren,et al.  Spectral Learning for Supervised Topic Models , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  David Alvarez-Melis,et al.  Topic Modeling in Twitter: Aggregating Tweets by Conversations , 2016, ICWSM.

[11]  Kuan-Yu Chen,et al.  Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling , 2007, IEEE Transactions on Knowledge and Data Engineering.

[12]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models , 2012, J. Mach. Learn. Res..

[13]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[14]  Ruixuan Li,et al.  Multi-Topic Tracking Model for dynamic social network , 2016 .

[15]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[16]  Tinghuai Ma,et al.  An efficient and scalable density-based clustering algorithm for datasets with complex structures , 2016, Neurocomputing.

[17]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[18]  Jiafeng Guo,et al.  BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[19]  Nicola Barbieri,et al.  Topic-aware social influence propagation models , 2012, Knowledge and Information Systems.

[20]  Tinghuai Ma,et al.  Detect structural‐connected communities based on BSCHEF in C‐DBLP , 2016, Concurr. Comput. Pract. Exp..

[21]  Pedro Carpena,et al.  Improving statistical keyword detection in short texts: Entropic and clustering approaches , 2013 .

[22]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[23]  Yao Wang,et al.  LED: A fast overlapping communities detection algorithm based on structural clustering , 2016, Neurocomputing.

[24]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[25]  Dong-Hong Ji,et al.  A topic-enhanced word embedding for Twitter sentiment classification , 2016, Inf. Sci..

[26]  Tinghuai Ma,et al.  A novel subgraph K+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K^{+}$$\end{document}-isomorphism method in social , 2017, Soft Computing.

[27]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[28]  Tinghuai Ma,et al.  Graph classification based on graph set reconstruction and graph kernel feature reduction , 2018, Neurocomputing.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Andrea L. Bertozzi,et al.  Topic Time Series Analysis of Microblogs , 2016 .

[31]  Saeid Nahavandi,et al.  Unsupervised mining of long time series based on latent topic model , 2013, Neurocomputing.

[32]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..