Synthetic Social Media Data Generation

This paper presents a novel system, synthetic high-fidelity social media data generator (SHIELD), for generating the synthetic social media data. SHIELD jointly generates time-varying, directed and weighted interaction graph structures and topic-driven text features similar to the input social media data. A synthetic interaction graph is generated by a social network model to minimize the distance to real graph and is enhanced by adding various patterns, such as anomalies and information cascades, interaction types, and temporal dynamics. A synthetic text generator based on the $n$ -gram Markov model is trained under each topic identified by topic modeling. Synthetic text and graph structures are combined through the assignment of synthetic social media entities. Extensive performance evaluation via a graph and text analysis is provided to demonstrate the statistical fidelity of large-scale synthetic data generated by SHIELD. A data evaluation exercise with human participants is executed to identify how difficult it is for a human to distinguish between tweets that were generated by SHIELD and tweets that were posted by real users. Experimental results followed by a statistical significance analysis showed that human participants cannot reliably distinguish between real and synthetic tweets.

[1]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[2]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[3]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[4]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[5]  Ben Y. Zhao,et al.  User interactions in social networks and their implications , 2009, EuroSys '09.

[6]  H. Vincent Poor,et al.  Delay of Social Search on Small-World Graphs , 2014 .

[7]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[8]  Gang Wang,et al.  Social Turing Tests: Crowdsourcing Sybil Detection , 2012, NDSS.

[9]  Priya Mahadevan,et al.  Systematic topology analysis and generation using degree correlations , 2006, SIGCOMM 2006.

[10]  Giovanni Pilato,et al.  EHeBby: An evocative humorist chat-bot , 2008, Mob. Inf. Syst..

[11]  Sandeep Subramanian,et al.  Adversarial Generation of Natural Language , 2017, Rep4NLP@ACL.

[12]  David F. Nettleton,et al.  A synthetic data generator for online social network graphs , 2016, Social Network Analysis and Mining.

[13]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[14]  Roman V. Yampolskiy,et al.  Evaluation of authorship attribution software on a Chat bot corpus , 2011, 2011 XXIII International Symposium on Information, Communication and Automation Technologies.

[15]  Rui Xiao,et al.  Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems , 2006, Third International Conference on Information Technology: New Generations (ITNG'06).

[16]  Christos Faloutsos,et al.  RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[17]  Thomas Holtgraves,et al.  A procedure for studying online conversational processing using a chat bot , 2007, Behavior research methods.

[18]  Ben Y. Zhao,et al.  Uncovering User Interaction Dynamics in Online Social Networks , 2015, ICWSM.

[19]  A. Vázquez Growing network with local rules: preferential attachment, clustering hierarchy, and degree correlations. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  Salto Martínez Rodrigo,et al.  Development and Implementation of a Chat Bot in a Social Network , 2012, 2012 Ninth International Conference on Information Technology - New Generations.

[21]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[22]  Giovanni Pilato,et al.  A Chat-bot based Multimodal Virtual Guide for Cultural Heritage Tours , 2006, PSC.

[23]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[24]  Weiyi Liu,et al.  Can GAN Learn Topological Features of a Graph? , 2017, ArXiv.

[25]  Ben Y. Zhao,et al.  Measurement-calibrated graph models for social network experiments , 2010, WWW '10.

[26]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[27]  Christos Faloutsos,et al.  RTG: a recursive realistic graph generator using random typing , 2009, Data Mining and Knowledge Discovery.

[28]  Minas Gjoka,et al.  2.5K-graphs: From sampling to generation , 2012, 2013 Proceedings IEEE INFOCOM.

[29]  Ben Y. Zhao,et al.  Understanding latent interactions in online social networks , 2010, IMC '10.

[30]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..