EmptyHeaded: A Relational Engine for Graph Processing

There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded's design is a new class of join algorithms that satisfy strong theoretical guarantees but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3x worse performance on SSSP.

[1]  Sungpack Hong,et al.  Taming Subgraph Isomorphism for RDF Query Processing , 2015, Proc. VLDB Endow..

[2]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[3]  Steffen Zeuch,et al.  Adapting Tree Structures for Processing with SIMD Instructions , 2014, EDBT.

[4]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[5]  Leonid Boytsov,et al.  SIMD compression and the intersection of sorted integers , 2014, Softw. Pract. Exp..

[6]  Panos Kalnis,et al.  GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph , 2014, Proc. VLDB Endow..

[7]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[8]  Zhe Wu,et al.  Graph analysis: do we have to reinvent the wheel? , 2013, GRADES.

[9]  Hiroshi Inoue,et al.  Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions , 2014, Proc. VLDB Endow..

[10]  Atri Rudra,et al.  Join Processing for Graph Patterns: An Old Dog with New Tricks , 2015, GRADES@SIGMOD/PODS.

[11]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[12]  Wolfgang Lehner,et al.  Fast Sorted-Set Intersection using SIMD Instructions , 2011, ADMS@VLDB.

[13]  L. Takac DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS , 2012 .

[14]  Rok Sosic,et al.  SNAP , 2016, ACM Trans. Intell. Syst. Technol..

[15]  David A. Patterson,et al.  Direction-optimizing breadth-first search , 2012, HiPC 2012.

[16]  Viktor Leis,et al.  The adaptive radix tree: ARTful indexing for main-memory databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[17]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[18]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[19]  Dan Suciu,et al.  From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System , 2015, SIGMOD Conference.

[20]  Rizal Setya Perdana What is Twitter , 2013 .

[21]  Carsten Binnig,et al.  Dictionary-based order-preserving string compression for main memory column stores , 2009, SIGMOD Conference.

[22]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[23]  Todd L. Veldhuizen,et al.  Leapfrog Triejoin: a worst-case optimal join algorithm , 2012, ArXiv.

[24]  Jignesh M. Patel,et al.  WideTable: An Accelerator for Analytical Data Processing , 2014, Proc. VLDB Endow..

[25]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[26]  Jakub Závodný,et al.  Size Bounds for Factorised Representations of Query Results , 2015, TODS.

[27]  Guy E. Blelloch,et al.  Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+ , 2015, 2015 Data Compression Conference.

[28]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[29]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[30]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[31]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[32]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[33]  Hai Jin,et al.  TripleBit: a Fast and Compact System for Large Scale RDF Data , 2013, Proc. VLDB Endow..

[34]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[35]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[36]  Wolfgang Lehner,et al.  The Graph Story of the SAP HANA Database , 2013, BTW.

[37]  Kenneth A. Ross,et al.  Implementing database operations using SIMD instructions , 2002, SIGMOD '02.

[38]  Matthias Jarke,et al.  Query Optimization in Database Systems , 1984, CSUR.

[39]  Sudhakar Yalamanchili,et al.  Red Fox: An Execution Environment for Relational Query Processing on GPUs , 2014, CGO '14.

[40]  Anand Rajaraman,et al.  Conjunctive query containment revisited , 2000, Theor. Comput. Sci..

[41]  Atri Rudra,et al.  FAQ: Questions Asked Frequently , 2015, PODS.

[42]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[43]  Alexander Zeier,et al.  Speeding Up Queries in Column Stores - A Case for Compression , 2010, DaWak.

[44]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[45]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[46]  Monica S. Lam,et al.  SociaLite: Datalog extensions for efficient social network analysis , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[47]  Jignesh M. Patel,et al.  BitWeaving: fast scans for main memory data processing , 2013, SIGMOD '13.

[48]  Matthieu Latapy,et al.  Main-memory triangle computations for very large (sparse (power-law)) graphs , 2008, Theor. Comput. Sci..

[49]  Sudhakar Yalamanchili,et al.  Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs , 2014, ADMS@VLDB.

[50]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[51]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..

[52]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[53]  Dorothea Wagner,et al.  Finding, Counting and Listing All Triangles in Large Graphs, an Experimental Study , 2005, WEA.

[54]  Kunle Olukotun,et al.  Old techniques for new join algorithms: A case study in RDF processing , 2016, 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW).

[55]  Krishna P. Gummadi,et al.  Growth of the flickr social network , 2008, WOSN '08.

[56]  Mohammad Al Hasan,et al.  An Iterative MapReduce Based Frequent Subgraph Mining Algorithm , 2013, IEEE Transactions on Knowledge and Data Engineering.

[57]  Wolfgang Lehner,et al.  Fast integer compression using SIMD instructions , 2010, DaMoN '10.

[58]  Alfons Kemper,et al.  HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[59]  Christopher Ré,et al.  DunceCap: Query Plans Using Generalized Hypertree Decompositions , 2015, SIGMOD Conference.

[60]  Monica S. Lam,et al.  Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis , 2013, Proc. VLDB Endow..

[61]  Atri Rudra,et al.  Skew strikes back: new developments in the theory of join algorithms , 2013, SGMD.

[62]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[63]  James A. Hendler,et al.  Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data , 2010, WWW '10.

[64]  Dániel Marx,et al.  Size Bounds and Query Plans for Relational Joins , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[65]  Norishige Chiba,et al.  Arboricity and Subgraph Listing Algorithms , 1985, SIAM J. Comput..

[66]  Rajeev Motwani,et al.  Clique partitions, graph compression and speeding-up algorithms , 1991, STOC '91.

[67]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[68]  Amol Deshpande,et al.  EAGr: supporting continuous ego-centric aggregate queries over large dynamic graphs , 2014, SIGMOD Conference.

[69]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[70]  Georg Gottlob,et al.  Hypertree Decompositions: Structure, Algorithms, and Applications , 2005, WG.

[71]  Jimmy J. Lin,et al.  NScale: neighborhood-centric large-scale graph analytics in the cloud , 2014, The VLDB Journal.

[72]  Mohammed J. Zaki,et al.  Arabesque: a system for distributed graph mining , 2015, SOSP.

[73]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[74]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[75]  Mihalis Yannakakis,et al.  Algorithms for Acyclic Database Schemes , 1981, VLDB.

[76]  Christopher Ré,et al.  GYM: A Multiround Join Algorithm In MapReduce , 2014, ArXiv.

[77]  Emir Pasalic,et al.  Design and Implementation of the LogicBlox System , 2015, SIGMOD Conference.

[78]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[79]  Kunle Olukotun,et al.  EmptyHeaded: A Relational Engine for Graph Processing , 2017, ACM Trans. Database Syst..

[80]  Sergei Vassilvitskii,et al.  Densest Subgraph in Streaming and MapReduce , 2012, Proc. VLDB Endow..

[81]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[82]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[83]  Christopher Ré,et al.  Aggregations over Generalized Hypertree Decompositions , 2015, ArXiv.

[84]  Jignesh M. Patel,et al.  The Case Against Specialized Graph Analytics Engines , 2015, CIDR.