All-in-One: Graph Processing in RDBMSs Revisited

To support analytics on massive graphs such as online social networks, RDF, Semantic Web, etc. many new graph algorithms are designed to query graphs for a specific problem, and many distributed graph processing systems are developed to support graph querying by programming. In this paper, we focus on RDBM, which has been well studied over decades to manage large datasets, and we revisit the issue how RDBM can support graph processing at the SQL level. Our work is motivated by the fact that there are many relations stored in RDBM that are closely related to a graph in real applications and need to be used together to query the graph, and RDBM is a system that can query and manage data while data may be updated over time. To support graph processing, in this work, we propose 4 new relational algebra operations, MM-join, MV-join, anti-join, and union-by-update. Here, MM-join and MV-join are join operations between two matrices and between a matrix and a vector, respectively, followed by aggregation computing over groups, given a matrix/vector can be represented by a relation. Both deal with the semiring by which many graph algorithms can be supported. The anti-join removes nodes/edges in a graph when they are unnecessary for the following computing. The union-by-update addresses value updates to compute PageRank, for example. The 4 new relational algebra operations can be defined by the 6 basic relational algebra operations with group-by & aggregation. We revisit SQL recursive queries and show that the 4 operations with others are ensured to have a fixpoint, following the techniques studied in DATALOG, and enhance the recursive WITH clause in SQL'99. We conduct extensive performance studies to test 10 graph algorithms using 9 large real graphs in 3 major RDBMs. We show that RDBMs are capable of dealing with graph processing in reasonable time. The focus of this work is at SQL level. There is high potential to improve the efficiency by main-memory RDBMs, efficient join processing in parallel, and new storage management.

[1]  Domagoj Vrgoc,et al.  Querying Graphs with Data , 2016, J. ACM.

[2]  Jeffrey D. Ullman,et al.  A survey of deductive database systems , 1995, J. Log. Program..

[3]  Monica S. Lam,et al.  SociaLite: Datalog extensions for efficient social network analysis , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[4]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[5]  Jignesh M. Patel,et al.  The Case Against Specialized Graph Analytics Engines , 2015, CIDR.

[6]  Samuel Madden,et al.  Graph analytics using vertica relational database , 2014 .

[7]  Robert Preis,et al.  Linear Time 1/2-Approximation Algorithm for Maximum Weighted Matching in General Graphs , 1999, STACS.

[8]  Thomas A. Henzinger,et al.  Computing simulations on finite and infinite graphs , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[9]  David Hardcastle,et al.  Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[10]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[11]  Jeffrey Xu Yu,et al.  Reachability querying: an independent permutation labeling approach , 2014, The VLDB Journal.

[12]  Sergio Greco,et al.  Datalog and Logic Databases , 2015, Synthesis Lectures on Data Management.

[13]  Jeffrey Xu Yu,et al.  Relational Approach for Shortest Path Discovery over Large Graphs , 2011, Proc. VLDB Endow..

[14]  Monica S. Lam,et al.  Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis , 2013, Proc. VLDB Endow..

[15]  Udayan Khurana,et al.  GraphGen: Exploring Interesting Graphs in Relational Data , 2015, Proc. VLDB Endow..

[16]  Fernando Sáenz-Pérez,et al.  Formalizing a Broader Recursion Coverage in SQL , 2013, PADL.

[17]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[18]  Ana Paula Appel,et al.  HADI: Mining Radii of Large Graphs , 2011, TKDD.

[19]  Krzysztof Stencel,et al.  Recursive Query Facilities in Relational Databases: A Survey , 2010, FGIT-DTA/BSBT.

[20]  Dawit Yimam Seid,et al.  Adaptive optimizations of recursive queries in teradata , 2012, SIGMOD Conference.

[21]  Bharat Bhargava,et al.  Advanced Database Systems , 1993, Lecture Notes in Computer Science.

[22]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[23]  Ying Zhang,et al.  SciQL: array data processing inside an RDBMS , 2013, SIGMOD '13.

[24]  Gang Hu,et al.  SQLGraph: An Efficient Relational-Based Property Graph Store , 2015, SIGMOD Conference.

[25]  Carlo Zaniolo,et al.  The deductive database system [Lscr ][Dscr ][Lscr ]++ , 2002, Theory and Practice of Logic Programming.

[26]  Chang Zhou,et al.  GLog: A high level graph analysis system using MapReduce , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[27]  Clement T. Yu,et al.  Necessary and sufficient conditions to linearize doubly recursive programs in logic databases , 1990, TODS.

[28]  Carlo Zaniolo,et al.  Optimizing recursive queries with monotonic aggregates in DeALS , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[29]  Neoklis Polyzotis,et al.  Scaling Datalog for Machine Learning on Big Data , 2012, ArXiv.

[30]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[31]  Christos Faloutsos,et al.  PEGASUS: mining peta-scale graphs , 2011, Knowledge and Information Systems.

[32]  Jeffrey D. Uuman Principles of database and knowledge- base systems , 1989 .

[33]  Wellington Cabrera,et al.  Unified Algorithm to Solve Several Graph Problems with Relational Queries , 2016, AMW.

[34]  Yves Métivier,et al.  An optimal bit complexity randomized distributed MIS algorithm , 2011, Distributed Computing.

[35]  Leland L. Beck,et al.  Smallest-last ordering and clustering and graph coloring algorithms , 1983, JACM.

[36]  Tinkara Toš,et al.  Graph Algorithms in the Language of Linear Algebra , 2012, Software, environments, tools.

[37]  S. Dongen Graph clustering by flow simulation , 2000 .

[38]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[39]  Carlo Zaniolo,et al.  Negation and Aggregates in Recursive Rules: the LDL++ Approach , 1993, DOOD.

[40]  Ashwin Machanavajjhala,et al.  Finding connected components in map-reduce in logarithmic rounds , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[41]  Jennifer Widom,et al.  HelP: High-level Primitives For Large-Scale Graph Processing , 2014, GRADES.

[42]  A. B. Kahn,et al.  Topological sorting of large networks , 1962, CACM.

[43]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[44]  Carlo Zaniolo,et al.  Graph Queries in a Next-Generation Datalog System , 2013, Proc. VLDB Endow..

[45]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[46]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[47]  Pablo Barceló Baeza Querying graph databases , 2013, PODS 2013.

[48]  Srinivasan Parthasarathy,et al.  A Framework for SQL-Based Mining of Large Graphs on Relational Databases , 2010, PAKDD.

[49]  Samuel Madden,et al.  GRAPHiQL: A graph intuitive query language for relational databases , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[50]  Kunle Olukotun,et al.  EmptyHeaded: A Relational Engine for Graph Processing , 2015, ACM Trans. Database Syst..

[51]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[52]  Carlos Ordonez,et al.  Optimization of Linear Recursive Queries in SQL , 2010, IEEE Transactions on Knowledge and Data Engineering.

[53]  Peter T. Wood,et al.  Query languages for graph databases , 2012, SGMD.

[54]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[55]  Wellington Cabrera,et al.  Comparing columnar, row and array DBMSs to process recursive queries on graphs , 2017, Inf. Syst..

[56]  Alan R. Simon,et al.  Sql: 1999 Understanding Relational Language Components , 2002 .

[57]  Tim Weninger,et al.  Thinking Like a Vertex , 2015, ACM Comput. Surv..

[58]  Carlo Zaniolo,et al.  Big Data Analytics with Datalog Queries on Spark , 2016, SIGMOD Conference.

[59]  Sergio Greco,et al.  Querying Graph Databases , 2000, EDBT.