EmptyHeaded: A Relational Engine for Graph Processing

There are two types of high-performance graph processing engines: lowand high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a highlevel engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded’s design is a new class of join algorithms that satisfy strong theoretical guarantees, but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and execution engine that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the bestof-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3× worse performance on SSSP. Finally, we show that the EmptyHeaded design can easily be extended to accommodate a standard resource description framework (RDF) workload, the LUBM benchmark. On the LUBM benchmark, we show that EmptyHeaded can compete with and sometimes outperform two high-level, but specialized RDF baselines (TripleBit and RDF-3X), while outperforming MonetDB by up to three orders of magnitude and LogicBlox by up to two orders of magnitude.

[1]  Zhe Wu,et al.  Graph analysis: do we have to reinvent the wheel? , 2013, GRADES.

[2]  Mohammad Al Hasan,et al.  An Iterative MapReduce Based Frequent Subgraph Mining Algorithm , 2013, IEEE Transactions on Knowledge and Data Engineering.

[3]  Dániel Marx,et al.  Size Bounds and Query Plans for Relational Joins , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[4]  Kenneth A. Ross,et al.  Implementing database operations using SIMD instructions , 2002, SIGMOD '02.

[5]  Norishige Chiba,et al.  Arboricity and Subgraph Listing Algorithms , 1985, SIAM J. Comput..

[6]  Alfons Kemper,et al.  HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[7]  Wolfgang Lehner,et al.  Fast Sorted-Set Intersection using SIMD Instructions , 2011, ADMS@VLDB.

[8]  Sudhakar Yalamanchili,et al.  Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs , 2014, ADMS@VLDB.

[9]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[10]  James A. Hendler,et al.  Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data , 2010, WWW '10.

[11]  Jignesh M. Patel,et al.  BitWeaving: fast scans for main memory data processing , 2013, SIGMOD '13.

[12]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[13]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[14]  Atri Rudra,et al.  FAQ: Questions Asked Frequently , 2015, PODS.

[15]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[16]  Leonid Boytsov,et al.  SIMD compression and the intersection of sorted integers , 2014, Softw. Pract. Exp..

[17]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[18]  Hiroshi Inoue,et al.  Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions , 2014, Proc. VLDB Endow..

[19]  Jignesh M. Patel,et al.  WideTable: An Accelerator for Analytical Data Processing , 2014, Proc. VLDB Endow..

[20]  Guy E. Blelloch,et al.  Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+ , 2015, 2015 Data Compression Conference.

[21]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[22]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[23]  Georg Gottlob,et al.  Hypertree Decompositions: Structure, Algorithms, and Applications , 2005, WG.

[24]  Jimmy J. Lin,et al.  NScale: neighborhood-centric large-scale graph analytics in the cloud , 2014, The VLDB Journal.

[25]  Hai Jin,et al.  TripleBit: a Fast and Compact System for Large Scale RDF Data , 2013, Proc. VLDB Endow..

[26]  Dan Suciu,et al.  From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System , 2015, SIGMOD Conference.

[27]  Matthias Jarke,et al.  Query Optimization in Database Systems , 1984, CSUR.

[28]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[29]  Mohammed J. Zaki,et al.  Arabesque: a system for distributed graph mining , 2015, SOSP.

[30]  Anand Rajaraman,et al.  Conjunctive query containment revisited , 1997, Theor. Comput. Sci..

[31]  Alexander Zeier,et al.  Speeding Up Queries in Column Stores - A Case for Compression , 2010, DaWak.

[32]  Rok Sosic,et al.  SNAP , 2016, ACM Trans. Intell. Syst. Technol..

[33]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[34]  Christopher Ré,et al.  DunceCap: Query Plans Using Generalized Hypertree Decompositions , 2015, SIGMOD Conference.

[35]  Monica S. Lam,et al.  Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis , 2013, Proc. VLDB Endow..

[36]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..

[37]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[38]  Viktor Leis,et al.  The adaptive radix tree: ARTful indexing for main-memory databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[39]  Todd L. Veldhuizen,et al.  Leapfrog Triejoin: a worst-case optimal join algorithm , 2012, ArXiv.

[40]  Monica S. Lam,et al.  SociaLite: Datalog extensions for efficient social network analysis , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[41]  Jakub Závodný,et al.  Size Bounds for Factorised Representations of Query Results , 2015, TODS.

[42]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[43]  Kunle Olukotun,et al.  EmptyHeaded: A Relational Engine for Graph Processing , 2015, ACM Trans. Database Syst..

[44]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[45]  Sergei Vassilvitskii,et al.  Densest Subgraph in Streaming and MapReduce , 2012, Proc. VLDB Endow..

[46]  Dorothea Wagner,et al.  Finding, Counting and Listing All Triangles in Large Graphs, an Experimental Study , 2005, WEA.

[47]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[48]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[49]  Christopher Ré,et al.  Aggregations over Generalized Hypertree Decompositions , 2015, ArXiv.

[50]  Jignesh M. Patel,et al.  The Case Against Specialized Graph Analytics Engines , 2015, CIDR.

[51]  Steffen Zeuch,et al.  Adapting Tree Structures for Processing with SIMD Instructions , 2014, EDBT.

[52]  Panos Kalnis,et al.  GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph , 2014, Proc. VLDB Endow..

[53]  Atri Rudra,et al.  Join Processing for Graph Patterns: An Old Dog with New Tricks , 2015, GRADES@SIGMOD/PODS.

[54]  Emir Pasalic,et al.  Design and Implementation of the LogicBlox System , 2015, SIGMOD Conference.

[55]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[56]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[57]  Wolfgang Lehner,et al.  The Graph Story of the SAP HANA Database , 2013, BTW.

[58]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[59]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[60]  Pinar Heggernes,et al.  Graph-Theoretic Concepts in Computer Science , 2016, Lecture Notes in Computer Science.

[61]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[62]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[63]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[64]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[65]  Mihalis Yannakakis,et al.  Algorithms for Acyclic Database Schemes , 1981, VLDB.

[66]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[67]  Christopher Ré,et al.  GYM: A Multiround Join Algorithm In MapReduce , 2014, ArXiv.

[68]  Kunle Olukotun,et al.  Old techniques for new join algorithms: A case study in RDF processing , 2016, 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW).

[69]  Atri Rudra,et al.  Skew strikes back: new developments in the theory of join algorithms , 2013, SGMD.

[70]  Sungpack Hong,et al.  Taming Subgraph Isomorphism for RDF Query Processing , 2015, Proc. VLDB Endow..

[71]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[72]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.