Datalography: Scaling datalog graph analytics on graph processing systems

This paper presents the first Datalog evaluation engine for executing graph analytics over BSP-style graph processing engines. Building on recent advances in Datalog that support efficient evaluation of aggregates functions, it is now easy for data scientists to author many important graph algorithms succinctly. Without the burden of low-level parallelization and optimization, data scientists can avoid programming to the quirks of the latest high-performance distributed computing framework. Where prior approaches build bespoke evaluation engines or modify generalized dataflow processing engines to achieve performance, this work shows how to efficiently evaluate Datalog directly on BSP-style graph processing engines such as Giraph. Datalography incorporates both traditional Datalog optimizations, such as semi-naive evaluation, and new evaluation algorithms and optimization techniques for efficient distributed evaluation of Datalog queries on graph processing engines. In particular we develop evaluation techniques that take advantage of super vertices, eager aggregation, and asynchronous execution to optimize graph processing on Pregel-like systems. We implement our algorithms on top of Apache Giraph and our results indicate that Datalography competes with native, tuned implementations, with some analytics running up to 9 times faster.

[1]  Kotagiri Ramamohanarao,et al.  Efficient Recursive Aggregation and Negation in Deductive Databases , 1998, IEEE Trans. Knowl. Data Eng..

[2]  Frank Neven,et al.  Relational transducers for declarative networking , 2010, JACM.

[3]  Emir Pasalic,et al.  Design and Implementation of the LogicBlox System , 2015, SIGMOD Conference.

[4]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[5]  Neoklis Polyzotis,et al.  Scaling Datalog for Machine Learning on Big Data , 2012, ArXiv.

[6]  Carlo Zaniolo,et al.  Optimizing recursive queries with monotonic aggregates in DeALS , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[7]  Kenneth A. Ross,et al.  Monotonic aggregation in deductive databases , 1992, J. Comput. Syst. Sci..

[8]  Joseph M. Hellerstein,et al.  Boom analytics: exploring data-centric, declarative programming for the cloud , 2010, EuroSys '10.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Magdalena Balazinska,et al.  Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines , 2015, Proc. VLDB Endow..

[11]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[12]  Joseph M. Hellerstein,et al.  Consistency Analysis in Bloom: a CALM and Collected Approach , 2011, CIDR.

[13]  Per-Åke Larson,et al.  Eager Aggregation and Lazy Aggregation , 1995, VLDB.

[14]  Monica S. Lam,et al.  Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis , 2013, Proc. VLDB Endow..

[15]  Carlo Zaniolo,et al.  Big Data Analytics with Datalog Queries on Spark , 2016, SIGMOD Conference.

[16]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[17]  E. F. Codd,et al.  Relational Completeness of Data Base Sublanguages , 1972, Research Report / RJ / IBM / San Jose, California.

[18]  Fernando Pereira,et al.  Yedalog: Exploring Knowledge at Scale , 2015, SNAPL.

[19]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[20]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[21]  Khuzaima Daudjee,et al.  Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems , 2015, Proc. VLDB Endow..

[22]  Monica S. Lam,et al.  SociaLite: Datalog extensions for efficient social network analysis , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[23]  Eugene Wong,et al.  Query processing in a system for distributed databases (SDD-1) , 1981, TODS.