TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model

As many applications encounter exponential growth in graph sizes, a fast and scalable graph generator has become more important than ever before due to lack of large-scale realistic graphs for evaluating the performance of graph processing methods. Although there have been proposed a number of methods to generate synthetic graphs, they are not very efficient in terms of space and time complexities, and so, cannot generate even trillion-scale graphs using a moderate size cluster of commodity machines. Here, we propose an efficient and scalable disk-based graph generator, TrillionG that can generate massive graphs in a short time only using a small amount of memory. It can generate a graph of a trillion edges following the RMAT or Kronecker models within two hours only using 10 PCs. We first generalize existing graph generation models to the scope-based generation model, where RMAT and Kronecker correspond to two extremes. Then, we propose a new graph generation model called the recursive vector model, which compromises two extremes, and so, solves the space and time complexity problems existing in RMAT and Kronecker. We also extend the recursive vector model so as to generate a semantically richer graph database. Through extensive experiments, we have demonstrated that TrillionG outperforms the state-of-the-art graph generators by up to orders of magnitude.

[1]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[2]  Tamara G. Kolda,et al.  An In-depth Study of Stochastic Kronecker Graphs , 2011, 2011 IEEE 11th International Conference on Data Mining.

[3]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[4]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[6]  George H. L. Fletcher,et al.  gMark: Schema-Driven Generation of Graphs and Queries , 2015, IEEE Transactions on Knowledge and Data Engineering.

[7]  Hassan Chafi,et al.  The LDBC Social Network Benchmark: Interactive Workload , 2015, SIGMOD Conference.

[8]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[9]  Willy Zwaenepoel,et al.  Chaos: scale-out graph processing from secondary storage , 2015, SOSP.

[10]  Behrouz Minaei-Bidgoli,et al.  ROLL: Fast In-Memory Generation of Gigantic Scale-free Networks , 2016, SIGMOD Conference.

[11]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[12]  H. Howie Huang,et al.  G-Store: High-Performance Graph Store for Trillion-Edge Processing , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Christian Staudt,et al.  NetworKit: An Interactive Tool Suite for High-Performance Network Analysis , 2014, ArXiv.

[14]  Wencong Xiao,et al.  GraM: scaling graph computation to the trillions , 2015, SoCC.

[15]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[16]  Fabio Checconi,et al.  Traversing Trillions of Edges in Real Time: Graph Exploration on Large-Scale Parallel Machines , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[17]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[18]  Ciprian Dobre,et al.  Intelligent services for Big Data science , 2014, Future Gener. Comput. Syst..

[19]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[20]  Blair D. Sullivan,et al.  Graph Generator Survey , 2013 .

[21]  Abdul Quamar,et al.  SWORD: scalable workload-aware data placement for transactional workloads , 2013, EDBT '13.

[22]  M. Tamer Özsu,et al.  Diversified Stress Testing of RDF Data Management Systems , 2014, SEMWEB.

[23]  Peter A. Boncz,et al.  S3G2: A Scalable Structure-Correlated Social Graph Generator , 2012, TPCTC.

[24]  Blair D. Sullivan,et al.  Synthetic Graph Generation for Data-Intensive HPC Benchmarking: Background and Framework , 2013 .

[25]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[26]  Jinwook Kim,et al.  GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs , 2016, SIGMOD Conference.

[27]  Béla Bollobás,et al.  Random Graphs , 1985 .

[28]  U Kang,et al.  TeGViz: Distributed Tera-Scale Graph Generation and Visualization , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[29]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[30]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[31]  Alexandru Iosup,et al.  LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on Parallel and Distributed Platforms , 2016, Proc. VLDB Endow..

[32]  J. W. Zhang,et al.  GSCALER: Synthetically Scaling A Given Graph , 2016, EDBT.

[33]  Avery Ching,et al.  One Trillion Edges: Graph Processing at Facebook-Scale , 2015, Proc. VLDB Endow..