GBASE: a scalable and general graph management system

Graphs appear in numerous applications including cyber-security, the Internet, social networks, protein networks, recommendation systems, and many more. Graphs with millions or even billions of nodes and edges are common-place. How to store such large graphs efficiently? What are the core operations/queries on those graph? How to answer the graph queries quickly? We propose GBASE, a scalable and general graph management and mining system. The key novelties lie in 1) our storage and compression scheme for a parallel setting and 2) the carefully chosen graph operations and their efficient implementation. We designed and implemented an instance of GBASE using MapReduce/Hadoop. GBASE provides a parallel indexing mechanism for graph mining operations that both saves storage space, as well as accelerates queries. We ran numerous experiments on real graphs, spanning billions of nodes and edges, and we show that our proposed GBASE is indeed fast, scalable and nimble, with significant savings in space and time.

[1]  Renée J. Miller,et al.  Information-theoretic tools for mining database structure from large data sets , 2004, SIGMOD '04.

[2]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[3]  David A. Bader,et al.  Approximating Betweenness Centrality , 2007, WAW.

[4]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[5]  Jimeng Sun,et al.  SmallBlue: Social Network Analysis for Expertise Search and Collective Intelligence , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  Christos Faloutsos,et al.  oddball: Spotting Anomalies in Weighted Graphs , 2010, PAKDD.

[7]  G. Karypis,et al.  Multilevel k-way hypergraph partitioning , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[8]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Chao Liu,et al.  BBM: bayesian browsing model from petabyte-scale data , 2009, KDD.

[10]  Ana Paula Appel,et al.  Radius Plots for Mining Tera-byte Scale Graphs: Algorithms, Patterns, and Observations , 2010, SDM.

[11]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[12]  Jian Pei,et al.  Neighbor query friendly compression of social networks , 2010, KDD.

[13]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[14]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[15]  Paul Erdös,et al.  On random graphs, I , 1959 .

[16]  Daniel J. Abadi,et al.  Column oriented Database Systems , 2009, Proc. VLDB Endow..

[17]  Alessandro Vespignani,et al.  K-core Decomposition: a Tool for the Visualization of Large Scale Networks , 2005, ArXiv.

[18]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[19]  Ulf Leser,et al.  Fast and practical indexing and querying of very large graphs , 2007, SIGMOD '07.

[20]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[21]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[22]  Michael Isard,et al.  Distributed data-parallel computing using a high-level programming language , 2009, SIGMOD Conference.

[23]  Purnamrita Sarkar,et al.  Fast nearest-neighbor search in disk-resident graphs , 2010, KDD.

[24]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[25]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[26]  Jimeng Sun,et al.  Neighborhood formation and anomaly detection in bipartite graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[27]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[28]  Bruce A. Reed,et al.  Finding a maximum-weight induced k-partite subgraph of an i-triangulated graph , 2010, Discret. Appl. Math..

[29]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[30]  Robert L. Grossman,et al.  Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[31]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[32]  Vipin Kumar,et al.  Parallel Multilevel series k-Way Partitioning Scheme for Irregular Graphs , 1999, SIAM Rev..

[33]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[34]  Marcin Zukowski,et al.  Positional update handling in column stores , 2010, SIGMOD Conference.

[35]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[36]  Martin L. Kersten,et al.  An architecture for recycling intermediates in a column-store , 2009, SIGMOD Conference.

[37]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[38]  Philip S. Yu,et al.  GConnect: A Connectivity Index for Massive Disk-resident Graphs , 2009, Proc. VLDB Endow..