From NoSQL Accumulo to NewSQL Graphulo: Design and utility of graph algorithms inside a BigTable database

Google BigTable's scale-out design for distributed key-value storage inspired a generation of NoSQL databases. Recently the NewSQL paradigm emerged in response to analytic workloads that demand distributed computation local to data storage. Many such analytics take the form of graph algorithms, a trend that motivated the GraphBLAS initiative to standardize a set of matrix math kernels for building graph algorithms. In this article we show how it is possible to implement the GraphBLAS kernels in a BigTable database by presenting the design of Graphulo, a library for executing graph algorithms inside the Apache Accumulo database. We detail the Graphulo implementation of two graph algorithms and conduct experiments comparing their performance to two main-memory matrix math systems. Our results shed insight into the conditions that determine when executing a graph algorithm is faster inside a database versus an external system-in short, that memory requirements and relative I/O are critical factors.

[1]  Alvin Cheung,et al.  PipeGen: Data Pipe Generator for Hybrid Analytics , 2016, SoCC.

[2]  Jia Wang,et al.  Truss Decomposition in Massive Networks , 2012, Proc. VLDB Endow..

[3]  D. Askin What Goes Around Comes Around? , 2005, Neonatal Network.

[4]  Steven Hand,et al.  Musketeer: all for one, one for all in data processing systems , 2015, EuroSys.

[5]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[6]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[7]  David A. Bader,et al.  Graphs, Matrices, and the GraphBLAS: Seven Good Reasons , 2015, ICCS.

[8]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.

[9]  Miriam A. M. Capretz,et al.  Data management in cloud environments: NoSQL and NewSQL data stores , 2013, Journal of Cloud Computing: Advances, Systems and Applications.

[10]  Sakti P. Ghosh Statistical relational tables for statistical database management , 1986, IEEE Transactions on Software Engineering.

[11]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[12]  Jeremy Kepner,et al.  Dynamic distributed dimensional data model (D4M) database and computation system , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Paul Burkhardt,et al.  A cloud-based approach to big graphs , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[14]  Jeremy Kepner,et al.  Using a Power Law distribution to describe big data , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[15]  Jeremy Kepner,et al.  D4M 2.0 schema: A general purpose high performance schema for the Accumulo database , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[16]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[17]  Jeremy Kepner,et al.  Achieving 100,000,000 database inserts per second using Accumulo and D4M , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[18]  David A. Bader Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems , 2006 .

[19]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[20]  Kun Li,et al.  UDA-GIST: An In-database Framework to Unify Data-Parallel and State-Parallel Analytics , 2015, Proc. VLDB Endow..

[21]  Pradeep Dubey,et al.  GraphPad: Optimized Graph Primitives for Parallel and Distributed Platforms , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  Jeremy Kepner,et al.  Associative Arrays: Unified Mathematics for Spreadsheets, Databases, Matrices, and Graphs , 2015, ArXiv.

[23]  Sara Cohen,et al.  User-defined aggregate functions: bridging theory and practice , 2006, SIGMOD Conference.

[24]  Henrik Loeser,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone? , 2011, BTW.

[25]  Jeremy Kepner,et al.  Graphulo: Linear Algebra Graph Kernels for NoSQL Databases , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[26]  Alvin Cheung,et al.  Optimizing database-backed applications with query synthesis , 2013, PLDI.

[27]  Jimmy J. Lin,et al.  Summingbird: A Framework for Integrating Batch and Online MapReduce Computations , 2014, Proc. VLDB Endow..

[28]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[29]  Jonathan W. Berry,et al.  A task-based linear algebra Building Blocks approach for scalable graph analytics , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[30]  Florin Rusu,et al.  GLADE: a scalable framework for efficient analytics , 2012, OPSR.

[31]  Jeremy Kepner,et al.  Graphulo implementation of server-side sparse matrix multiply in the Accumulo database , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[32]  Markus Bundschus,et al.  Towards a Next-Generation Matrix Library for Java , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[33]  Paolo Papotti,et al.  Road to Freedom in Big Data Analytics , 2016, EDBT.