Scalable parallel graph algorithms with matrix–vector multiplication evaluated with queries

Graph problems are significantly harder to solve with large graphs residing on disk compared to main memory only. In this work, we study how to solve four important graph problems: reachability from a source vertex, single source shortest path, weakly connected components, and PageRank. It is well known that the aforementioned algorithms can be expressed as an iteration of matrix–vector multiplications under different semi-rings. Based on this mathematical foundation, we show how to express the computation with standard relational queries and then we study how to efficiently evaluate them in parallel in a shared-nothing architecture. We identify a common algorithmic pattern that unifies the four graph algorithms, considering a common mathematical foundation based on sparse matrix–vector multiplication. The net gain is that our SQL-based approach enables solving “big data” graph problems on parallel database systems, debunking common wisdom that they are cumbersome and slow. Using large social networks and hyper-link real data sets, we present performance comparisons between a columnar DBMS, an open-source array DBMS, and Spark’s GraphX.

[1]  Wellington Cabrera,et al.  Unified Algorithm to Solve Several Graph Problems with Relational Queries , 2016, AMW.

[2]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[3]  Carey L. Williamson,et al.  A tale of the tails: Power-laws in internet measurements , 2013, IEEE Network.

[4]  Tinkara Toš,et al.  Graph Algorithms in the Language of Linear Algebra , 2012, Software, environments, tools.

[5]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[6]  Wellington Cabrera,et al.  Comparing columnar, row and array DBMSs to process recursive queries on graphs , 2017, Inf. Syst..

[7]  Martin L. Kersten,et al.  MonetDB: Two Decades of Research in Column-oriented Database Architectures , 2012, IEEE Data Eng. Bull..

[8]  Gene H. Golub,et al.  Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[9]  Jeremy T. Fineman,et al.  Fundamental Graph Algorithms , 2011, Graph Algorithms in the Language of Linear Algebra.

[10]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[11]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[12]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Michael Stonebraker,et al.  The Architecture of SciDB , 2011, SSDBM.

[15]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[16]  Michael J. Carey,et al.  Pregelix: Big(ger) Graph Analytics on a Dataflow Engine , 2014, Proc. VLDB Endow..

[17]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[18]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[19]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[20]  Christian Bizer,et al.  Graph structure in the web: aggregated by pay-level domain , 2014, WebSci '14.

[21]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[22]  Florin Rusu,et al.  Dot-Product Join: An Array-Relation Join Operator for Big Model Analytics , 2016, ArXiv.

[23]  Zhe Wu,et al.  Graph analysis: do we have to reinvent the wheel? , 2013, GRADES.

[24]  Michael Stonebraker,et al.  VERTEXICA: Your Relational Friend for Graph Analytics! , 2014, Proc. VLDB Endow..

[25]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[26]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[27]  Carlos Ordonez,et al.  Optimization of Linear Recursive Queries in SQL , 2010, IEEE Transactions on Knowledge and Data Engineering.

[28]  Wolfgang Lehner,et al.  SynopSys: large graph analytics in the SAP HANA database through summarization , 2013, GRADES.

[29]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[30]  Yu Cheng,et al.  A Survey on Array Storage, Query Languages, and Systems , 2013, ArXiv.

[31]  Samuel Madden,et al.  Graph analytics using vertica relational database , 2014 .

[32]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.