SYNTHG: Mimicking RDF Graphs Using Tensor Factorization

There is a need for synthetic graphs to help benchmarking efforts. Synthetic graphs that mimic real-world graphs can be used to avoid sending sensitive information to third parties while preserving topological characteristics of the input original graph. They can also be used to evaluate the scalability of different algorithms since the size of synthetic graphs can be scaled. In view of these applications, we introduce a novel approach to mimik RDF graphs. Our approach introduces a random rotation in the tensor factorization of the input RDF graph. By combining this matrix with the core tensor computed by the factorization, our approach can generate a graph which maintains the querying characteristics of the input graph, while not permitting a reconstruction of the input graph. We use Semantic Web Dog Food and DBpedia 2016 to evaluate our approach and compare the original, reconstructed and synthetic graphs by using them to benchmark five triple stores. The results show that the Pearson correlation between the performance of the triple stores under original and synthetic graphs is 0.91, 0.64 for Semantic Web Dog Food and DBpedia respectively. Our results also suggest that the synthetic graphs inherit the main graph characteristics of the original graphs. SynthG is open-source and is available at: https://github.com/dice-group/SynthG

[1]  Barry Bishop,et al.  OWLIM: A family of scalable semantic repositories , 2011, Semantic Web.

[2]  Jens Lehmann,et al.  Iguana: A Generic Framework for Benchmarking the Read-Write Performance of Triple Stores , 2017, SEMWEB.

[3]  Hans-Peter Kriegel,et al.  A Three-Way Model for Collective Learning on Multi-Relational Data , 2011, ICML.

[4]  Muhammad Saleem,et al.  LSQ: The Linked SPARQL Queries Dataset , 2015, SEMWEB.

[5]  Felix Conrads,et al.  How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benchmarks , 2019, WWW.

[6]  Steffen Staab,et al.  TripleRank: Ranking Semantic Web Data by Tensor Decomposition , 2009, SEMWEB.

[7]  Axel Polleres,et al.  Binary RDF representation for publication and exchange (HDT) , 2013, J. Web Semant..

[8]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[9]  Hassan Chafi,et al.  The LDBC Social Network Benchmark: Interactive Workload , 2015, SIGMOD Conference.

[10]  Hans-Peter Kriegel,et al.  Factorizing YAGO: scalable machine learning for linked data , 2012, WWW.

[11]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[12]  Muhammad Saleem,et al.  FEASIBLE: A Feature-Based SPARQL Benchmark Generation Framework , 2015, SEMWEB.

[13]  Sebastian Rudolph,et al.  Foundations of Semantic Web Technologies , 2009 .

[14]  Octavian Udrea,et al.  Apples and oranges: a comparison of RDF benchmarks and real RDF datasets , 2011, SIGMOD '11.

[15]  Ruben Verborgh,et al.  Generating public transport data based on population distributions for RDF benchmarking , 2019, Semantic Web.

[16]  Orri Erling,et al.  RDF Support in the Virtuoso DBMS , 2007, CSSW.

[17]  Volker Tresp,et al.  Tensor Factorization for Multi-relational Learning , 2013, ECML/PKDD.

[18]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..