Twitter Network Mimicking for Data Storage Benchmarking

A significant portion of the textual data available on the Web comes from microblogging services such as Twitter. A considerable body of research has hence investigated methods for processing streams of short texts from such services as well as benchmarking online social networks. However, the costs connected with the acquisition of real microblogs is prohibitive for researchers with limited resources. We address this challenge by proposing TWIG, a benchmark generator for microblogging services similar to Twitter. It is a collection of algorithms to: 1) serialize users and their tweets from Twitter in RDF, 2) analyze such serialized data in RDF to approximate distributions over the underlying social network and 3) mimic a social network by generating synthetic users and tweets based on the approximated distributions. By using TWIG generated data, researchers can carry out preliminary evaluations of social network analysis and NLP approaches at low cost. Experimental and human evaluation results suggest that the synthetic tweets generated by TWIG are hardly distinguishable from human-generated tweets. Moreover, our results also underpin the scalability of our approach. Our Java implementation of TWIG is open-source and can be found at: https://github.com/dice-group/TWIG.

[1]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[2]  Hsinchun Chen,et al.  The State-of-the-Art in Twitter Sentiment Analysis , 2018, ACM Trans. Manag. Inf. Syst..

[3]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[4]  P. K. Srijith,et al.  Classification of Short-Texts Generated During Disasters: A Deep Neural Network Based Approach , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[5]  Björn Gambäck,et al.  Twitter Named Entity Extraction and Linking Using Differential Evolution , 2016, ICON.

[6]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[7]  Steffen Lohmann,et al.  WebVOWL: Web-based Visualization of Ontologies , 2014, EKAW.

[8]  Ankur P. Parikh,et al.  Thieves on Sesame Street! Model Extraction of BERT-based APIs , 2019, ICLR.

[9]  Dimitra Gkatzia,et al.  Comparing Multi-label Classification with Reinforcement Learning for Summarisation of Time-series Data , 2014, ACL.

[10]  Jure Leskovec,et al.  Patterns of temporal variation in online media , 2011, WSDM '11.

[11]  Cornelia Caragea,et al.  Identifying informative messages in disaster events using Convolutional Neural Networks , 2016 .

[12]  S. E. Ahmed,et al.  Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference , 2008, Technometrics.

[13]  Emiel Krahmer,et al.  PASS: A Dutch data-to-text system for soccer, targeted towards specific audiences , 2017, INLG.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[16]  Alessandro Mazzei,et al.  Designing and testing the messages produced by a virtual dietitian , 2018, INLG.

[17]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[18]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[19]  Yashvardhan Sharma,et al.  TwiBiNG: A Bipartite News Generator Using Twitter , 2014, SNOW-DC@WWW.

[20]  Guoqin Ma,et al.  Tweets Classification with BERT in the Field of Disaster Management , 2019 .

[21]  Fred Popowich,et al.  Automatic Tweet Generation From Traffic Incident Data , 2016, WebNLG.

[22]  Milos Jovanovik,et al.  MOCHA 2017 as a Challenge for Virtuoso , 2017, SemWebEval@ESWC.

[23]  Axel-Cyrille Ngonga Ngomo,et al.  A Holistic Natural Language Generation Framework for the Semantic Web , 2019, RANLP.

[24]  Harith Alani,et al.  Semantic Wide and Deep Learning for Detecting Crisis-Information Categories on Social Media , 2017, SEMWEB.

[25]  Axel-Cyrille Ngonga Ngomo,et al.  HOBBIT: A platform for benchmarking Big Linked Data , 2020, Data Sci..

[26]  Hamada M. Zahera Fine-tuned BERT Model for Multi-Label Tweets Classification , 2019, TREC.

[27]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[28]  Advaith Siddharthan,et al.  SaferDrive: An NLG-based behaviour change support system for drivers , 2018, Natural Language Engineering.

[29]  Hassan Sajjad,et al.  Robust Classification of Crisis-Related Data on Social Networks Using Convolutional Neural Networks , 2017, ICWSM.

[30]  Jin Wang,et al.  Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification , 2017, IJCAI.

[31]  Jim Hunter,et al.  Automatic Generation of Textual Summaries from Neonatal Intensive Care Data , 2007, AIME.