Improved Topic Modeling in Twitter Through Community Pooling

Social networks play a fundamental role in propagation of information and news. Characterizing the content of the messages becomes vital for different tasks, like breaking news detection, personalized message recommendation, fake users detection, information flow characterization and others. However, Twitter posts are short and often less coherent than other text documents, which makes it challenging to apply text mining algorithms to these datasets efficiently. Tweet-pooling (aggregating tweets into longer documents) has been shown to improve automatic topic decomposition, but the performance achieved in this task varies depending on the pooling method. In this paper, we propose a new pooling scheme for topic modelling in Twitter, which groups tweets whose authors belong to the same community (group of users who mainly interact with each other but not with other groups) on a user interaction graph. We present a complete evaluation of this methodology, state of the art schemes and previous pooling models in terms of the cluster quality, document retrieval tasks performance and supervised machine learning classification score. Results show that our Community polling method outperformed other methods on the majority of metrics in two heterogeneous datasets, while also reducing the running time. This is useful when dealing with big amounts of noisy and short user-generated social media texts. Overall, our findings contribute to an improved methodology for identifying the latent topics in a Twitter dataset, without the need of modifying the basic machinery of a topic decomposition model.

[1]  Natalia Aruguete,et al.  Time to #Protest: Selective Exposure, Cascading Activation, and Framing in Social Media , 2018 .

[2]  Mark Dredze,et al.  You Are What You Tweet: Analyzing Twitter for Public Health , 2011, ICWSM.

[3]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[4]  David Alvarez-Melis,et al.  Topic Modeling in Twitter: Aggregating Tweets by Conversations , 2016, ICWSM.

[5]  Tinghuai Ma,et al.  A time-series based aggregation scheme for topic detection in Weibo short texts , 2019 .

[6]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[7]  Andreas Müller,et al.  Introduction to Machine Learning with Python: A Guide for Data Scientists , 2016 .

[8]  M. McCombs Agenda setting function of mass media , 1977 .

[9]  F. Rudzicz Human Language Technologies : The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2010 .

[10]  Mauricio Quezada,et al.  A Lightweight Representation of News Events on Social Media , 2019, SIGIR.

[11]  Hywel T. P. Williams,et al.  Network-Based Pooling for Topic Modeling on Microblog Content , 2019, SPIRE.

[12]  Nadeem Akhtar,et al.  User graph topic model , 2019, J. Intell. Fuzzy Syst..

[13]  Hiba J. Aleqabie,et al.  Events Tagging in Twitter Using Twitter Latent Dirichlet Allocation , 2018 .

[14]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[15]  Craig MacDonald,et al.  Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data , 2016, SIGIR.

[16]  Lyle H. Ungar,et al.  The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions , 2018, EMNLP.

[17]  Derek Ruths,et al.  Classifying Political Orientation on Twitter: It's Not Easy! , 2013, ICWSM.

[18]  D. Shaw,et al.  Agenda setting function of mass media , 1972 .

[19]  Denys Poshyvanyk,et al.  Using Relational Topic Models to capture coupling among classes in object-oriented software systems , 2010, 2010 IEEE International Conference on Software Maintenance.

[20]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[21]  Malek Hajjem,et al.  Combining IR and LDA Topic Modeling for Filtering Microblogs , 2017, KES.

[22]  Pablo Balenzuela,et al.  Quantifying time-dependent Media Agenda and public opinion by topic modeling , 2018, Physica A: Statistical Mechanics and its Applications.