Data Set Synthesis Based on Known Correlations and Distributions for Expanded Social Graph Generation

Nowadays, data created through the usage of different services are most commonly not available to the average researcher. Security and privacy have become a top concern, which has further restricted access to certain real-life data, especially holding true for social networks. This is why synthetic data generators have become a very important area of research, particularly synthetic social graph generators. However, even today, such generators mostly create graphs that contain just the information whether two nodes are connected. Fortunately, there is an existing conceptual solution for an expanded social graph generator that aims to generate synthetic graphs containing multiple weighted edges between nodes, thus showing various types of relationships among those nodes, all based on known real-life data characteristics. One of its proposed steps is the generation of necessary data according to provided distributions and correlations. This paper focuses on the generation of such data by adapting an existing iterative algorithm for non-normal multivariate data simulation to generate synthetic data based on the publicly available distributions and correlations of Facebook interaction parameters. It is shown that the characteristics of the generated synthetic data are similar to the known characteristics of the real-life data, proving that the chosen algorithm, along with the accompanying alterations, can be used as one of the steps within the process of generating a synthetic expanded social graph.

[1]  Bonghee Hong,et al.  A generator of test data set for tactical moving objects based on velocity , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[2]  Luka Humski,et al.  Analysis of Facebook Interaction as Basis for Synthetic Expanded Social Graph Generation , 2019, IEEE Access.

[3]  Xiufeng Liu,et al.  A Prediction-Based Smart Meter Data Generator , 2016, 2016 19th International Conference on Network-Based Information Systems (NBiS).

[4]  Patrick P. K. Chan,et al.  Synthetic Data Generator for Classification Rules Learning , 2016, 2016 7th International Conference on Cloud Computing and Big Data (CCBD).

[5]  Allen I. Fleishman A method for simulating non-normal distributions , 1978 .

[6]  Kern W. Dickman,et al.  Sample and population score matrices and sample correlation matrices from an arbitrary population correlation matrix , 1962 .

[7]  Vitaly Shmatikov,et al.  De-anonymizing Social Networks , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[8]  Chenyang Liu,et al.  Attribute Couplet Attacks and Privacy Preservation in Social Networks , 2017, IEEE Access.

[9]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[10]  C. D. Vale,et al.  Simulating multivariate nonnormal distributions , 1983 .

[11]  Gillian M. Raab,et al.  synthpop: Bespoke Creation of Synthetic Data in R , 2016 .

[12]  John Ruscio,et al.  Simulating Multivariate Nonnormal Data Using an Iterative Algorithm , 2008, Multivariate behavioral research.

[13]  Nur Wahida,et al.  Automatic Artificial Data Generator: Framework and implementation , 2016, 2016 International Conference on Information and Communication Technology (ICICTM).

[14]  Ivan Marsic,et al.  Semi-Synthetic Trauma Resuscitation Process Data Generator , 2017, 2017 IEEE International Conference on Healthcare Informatics (ICHI).

[15]  Todd C. Headrick,et al.  The power method transformation: its probability density function, distribution function, and its further use for fitting data , 2007 .

[16]  Marko Robnik-Sikonja Data Generators for Learning Systems Based on RBF Networks , 2016, IEEE Transactions on Neural Networks and Learning Systems.