Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries

Graph processing has become an important part of multiple areas of computer science, such as machine learning, computational sciences, medical applications, social network analysis, and many others. Numerous graphs such as web or social networks may contain up to trillions of edges. Often, these graphs are also dynamic (their structure changes over time) and have domain-specific rich data associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size of such datasets, combined with the irregular nature of graph processing, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., triple stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and ACID). 45 graph database systems are presented and compared, including Neo4j, OrientDB, or Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we describe research and engineering challenges to outline the future of graph databases.

[1]  Borislav Iordanov,et al.  HyperGraphDB: A Generalized Graph Database , 2010, WAIM Workshops.

[2]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[3]  Torsten Hoefler,et al.  GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra , 2021, Proc. VLDB Endow..

[4]  Tim Bray,et al.  Internet Engineering Task Force (ietf) the Javascript Object Notation (json) Data Interchange Format , 2022 .

[5]  Torsten Hoefler,et al.  Slim Fly: A Cost Effective Low-Diameter Network Topology , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Peter Sanders,et al.  Recent Advances in Graph Partitioning , 2013, Algorithm Engineering.

[7]  Jeremy Chen,et al.  Graphflow: An Active Graph Database , 2017, SIGMOD Conference.

[8]  Jeffrey Xu Yu,et al.  All-in-One: Graph Processing in RDBMSs Revisited , 2017, SIGMOD Conference.

[9]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[10]  T. Hoefler,et al.  Slim graph: practical lossy graph compression for approximate graph processing, storage, and analytics , 2019, SC.

[11]  David A. Bader,et al.  STINGER: High performance data structure for streaming graphs , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[12]  Erhard Rahm,et al.  Management and Analysis of Big Graph Data: Current Systems and Open Challenges , 2017, Handbook of Big Data Technologies.

[13]  Torsten Hoefler,et al.  Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages , 2015, HPDC.

[14]  Valeria De Antonellis,et al.  Relational Database Theory , 1993 .

[15]  Vikram Singh,et al.  Graph pattern matching: A brief survey of challenges and research directions , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[16]  Torsten Hoefler,et al.  Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations , 2018, ArXiv.

[17]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[18]  Jaroslav Pokorný,et al.  Graph Databases: Their Power and Limitations , 2015, CISIM.

[19]  Franz Franchetti,et al.  Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[20]  Dalila Chiadmi,et al.  A DSL-Based Framework for Performance Assessment , 2019 .

[21]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[22]  Michael F. Ringenburg,et al.  Quantifying Performance of CGE : A Unified Scalable Pattern Mining and Search System , 2017 .

[23]  Shahram Ghandeharizadeh,et al.  BG: A Benchmark to Evaluate Interactive Social Networking Actions , 2013, CIDR.

[24]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[25]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[26]  John Shalf,et al.  Programming Abstractions for Data Locality , 2014 .

[27]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[28]  Torsten Hoefler,et al.  Substream-Centric Maximum Matchings on FPGA , 2019, FPGA.

[29]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[30]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[31]  Haixun Wang,et al.  Trinity: a distributed graph engine on a memory cloud , 2013, SIGMOD '13.

[32]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[33]  Heikki Topi,et al.  Modern Database Management , 1999 .

[34]  Yuanyuan Tian,et al.  IBM Db2 Graph: Supporting Synergistic and Retrofittable Graph Queries Inside IBM Db2 , 2020, SIGMOD Conference.

[35]  E. F. Codd Relational database: a practical foundation for productivity , 2007 .

[36]  Jorge Pérez,et al.  Semantics and Complexity of GraphQL , 2018, WWW.

[37]  Yuanyuan Tian,et al.  Big Graph Analytics Systems , 2016, SIGMOD Conference.

[38]  Michael Kay,et al.  XSLT Programmer's Reference , 2000 .

[39]  Wenguang Chen,et al.  ShenTu: Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Torsten Hoefler,et al.  Fault tolerance for remote memory access programming models , 2014, HPDC '14.

[41]  Alexandru Iosup,et al.  LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on Parallel and Distributed Platforms , 2016, Proc. VLDB Endow..

[42]  Luca Benini,et al.  Network-accelerated non-contiguous memory transfers , 2019, SC.

[43]  Torsten Hoefler,et al.  Communication-avoiding parallel minimum cuts and connected components , 2018, PPoPP.

[44]  Lei Zou,et al.  gStore: a graph-based SPARQL query engine , 2014, The VLDB Journal.

[45]  S. Gajendran A Survey on NoSQL Databases , 2012 .

[46]  Olaf Hartig,et al.  Reconciliation of RDF* and Property Graphs , 2014, ArXiv.

[47]  Josep-Lluís Larriba-Pey,et al.  Efficient graph management based on bitmap indices , 2012, IDEAS '12.

[48]  Sungpack Hong,et al.  PGQL: a property graph query language , 2016, GRADES '16.

[49]  Marko A. Rodriguez,et al.  The Gremlin Graph Traversal Machine and Language , 2015, ArXiv.

[50]  Torsten Hoefler,et al.  Graph Processing on FPGAs: Taxonomy, Survey, Challenges , 2019, ArXiv.

[51]  Enhong Chen,et al.  Multi-Path Transport for RDMA in Datacenters , 2018, NSDI.

[52]  Michael Stonebraker,et al.  Readings in Database Systems , 1988 .

[53]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[54]  Torsten Hoefler,et al.  To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations , 2017, HPDC.

[55]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[56]  Jonathan Hayes,et al.  A graph model for rdf , 2004 .

[57]  Alexandru T. Balaban,et al.  Applications of graph theory in chemistry , 1985, J. Chem. Inf. Comput. Sci..

[58]  Aparna Vaikuntam,et al.  Evaluation of contemporary graph databases , 2014, COMPUTE '14.

[59]  Torsten Hoefler,et al.  Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[60]  C. J. Date A Guide to the SQL Standard , 1987 .

[61]  Gang Hu,et al.  SQLGraph: An Efficient Relational-Based Property Graph Store , 2015, SIGMOD Conference.

[62]  Josep-Lluís Larriba-Pey,et al.  The linked data benchmark council: a graph and RDF industry benchmarking effort , 2014, SGMD.

[63]  Bruce Momjian,et al.  PostgreSQL: Introduction and Concepts , 2000 .

[64]  K. Xirogiannopoulos,et al.  GraphGen: Adaptive Graph Processing using Relational Databases , 2017, GRADES@SIGMOD/PODS.

[65]  Amine Mhedhbi,et al.  The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing , 2017 .

[66]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[67]  M. Tamer Özsu A survey of RDF data management systems , 2016, Frontiers of Computer Science.

[68]  Sanjay Sharma,et al.  Cassandra Design Patterns , 2014 .

[69]  Stefan Plantikow,et al.  Cypher: An Evolving Query Language for Property Graphs , 2018, SIGMOD Conference.

[70]  Torsten Hoefler,et al.  Remote Memory Access Programming in MPI-3 , 2015, TOPC.

[71]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[72]  Binildas A. Christudas MySQL , 2019, Practical Microservices Architectural Patterns.

[73]  D. R. Fulkerson,et al.  On the Max Flow Min Cut Theorem of Networks. , 1955 .

[74]  Sherif Sakr,et al.  Large scale graph processing systems: survey and an experimental evaluation , 2015, Cluster Computing.

[75]  Olaf Hartig,et al.  RDF* and SPARQL*: An Alternative Approach to Annotate Statements in RDF , 2017, SEMWEB.

[76]  Margo I. Seltzer,et al.  Berkeley DB , 1999, USENIX Annual Technical Conference, FREENIX Track.

[77]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[78]  Wolfgang Lehner,et al.  The Graph Story of the SAP HANA Database , 2013, BTW.

[79]  Torsten Hoefler,et al.  Log(graph): a near-optimal high-performance graph representation , 2018, PACT.

[80]  Hassan Chafi,et al.  The LDBC Social Network Benchmark: Interactive Workload , 2015, SIGMOD Conference.

[81]  Hasso Plattner,et al.  A common database approach for OLTP and OLAP using an in-memory column database , 2009, SIGMOD Conference.

[82]  Salim Jouili,et al.  An Empirical Comparison of Graph Databases , 2013, 2013 International Conference on Social Computing.

[83]  George H. L. Fletcher,et al.  Querying Graphs , 2018, Querying Graphs.

[84]  René Peinl,et al.  Performance of graph query languages: comparison of cypher, gremlin and native access in Neo4j , 2013, EDBT '13.

[85]  M. Tamer Özsu,et al.  An Experimental Comparison of Pregel-like Graph Processing Systems , 2014, Proc. VLDB Endow..

[86]  Vladimir Vlassov,et al.  High-Level Programming Abstractions for Distributed Graph Processing , 2016, IEEE Transactions on Knowledge and Data Engineering.

[87]  Brighten Godfrey,et al.  DRILL: Micro Load Balancing for Low-latency Data Center Networks , 2017, SIGCOMM.

[88]  Félix Cuadrado,et al.  Cytosm: Declarative Property Graph Queries Without Data Migration , 2017, GRADES@SIGMOD/PODS.

[89]  N S Patil,et al.  A Survey on Graph Database Management Techniques for Huge Unstructured Data , 2018 .

[90]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[91]  Wenguang Chen,et al.  LiveGraph , 2019, Proc. VLDB Endow..

[92]  Sungpack Hong,et al.  PGX.D: a fast distributed graph processing engine , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[93]  Satu Elisa Schaeffer,et al.  Survey Graph clustering , 2007 .

[94]  Lars George,et al.  HBase - The Definitive Guide: Random Access to Your Planet-Size Data , 2011 .

[95]  Yannis Velegrakis,et al.  Beyond Macrobenchmarks: Microbenchmark-based Graph Database Evaluation , 2018, Proc. VLDB Endow..

[96]  Avery Ching,et al.  One Trillion Edges: Graph Processing at Facebook-Scale , 2015, Proc. VLDB Endow..

[97]  Emin Gün Sirer,et al.  Weaver: A High-Performance, Transactional Graph Database Based on Refinable Timestamps , 2015, Proc. VLDB Endow..

[98]  Lei Chen,et al.  Hermes: Dynamic Partitioning for Distributed Social Network Graph Databases , 2015, EDBT.

[99]  Torsten Hoefler,et al.  SlimSell: A Vectorizable Graph Representation for Breadth-First Search , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[100]  Hartmut Kaiser,et al.  Extending C++ with co-array semantics , 2016, ARRAY@PLDI.

[101]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[102]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[103]  Marko A. Rodriguez,et al.  The Gremlin graph traversal machine and language (invited talk) , 2015, DBPL.

[104]  Boris Motik,et al.  PGX.D/Async: A Scalable Distributed Graph Pattern Matching Engine , 2017, GRADES@SIGMOD/PODS.

[105]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[106]  David J. DeWitt,et al.  The Object-Oriented Database System Manifesto , 1994, Building an Object-Oriented Database System, The Story of O2.

[107]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[108]  Torsten Hoefler,et al.  Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability , 2018, ASPLOS.

[109]  Yiran Chen,et al.  GraphR: Accelerating Graph Processing Using ReRAM , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[110]  Torsten Hoefler,et al.  FatPaths: Routing in Supercomputers, Data Centers, and Clouds with Low-Diameter Networks when Shortest Paths Fall Short , 2019, ArXiv.

[111]  Irena Holubová Analysis and Experimental Comparison of Graph Databases , 2013 .

[112]  C. M. Sperberg-McQueen,et al.  Extensible markup language , 1997 .

[113]  Torsten Hoefler,et al.  Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations , 2015, ICS.

[114]  Alexandru Iosup,et al.  Graphalytics: A Big Data Benchmark for Graph-Processing Platforms , 2015, GRADES@SIGMOD/PODS.

[115]  Olaf Hartig,et al.  Foundations to Query Labeled Property Graphs using SPARQL , 2019, SEM4TRA-AMAR@SEMANTiCS.

[116]  Rohit kumar Kaliyar,et al.  Graph databases: A survey , 2015, International Conference on Computing, Communication & Automation.

[117]  Fabio Petroni,et al.  HDRF: Stream-Based Partitioning for Power-Law Graphs , 2015, CIKM.

[118]  Bin Jiang,et al.  A Short Note on Data-Intensive Geospatial Computing , 2011, IF&GIS.

[119]  Bettina Kemme,et al.  Data Replication , 2009, Encyclopedia of Database Systems.

[120]  Jimeng Sun,et al.  gbase: an efficient analysis platform for large graphs , 2012, The VLDB Journal.

[121]  Torsten Hoefler,et al.  Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism , 2019, ArXiv.

[122]  Willy Zwaenepoel,et al.  Chaos: scale-out graph processing from secondary storage , 2015, SOSP.

[123]  Torsten Hoefler,et al.  Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication , 2016, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[124]  Philip S. Yu,et al.  Fast Graph Pattern Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[125]  Lawrence B. Holder,et al.  Insider Threat Detection Using a Graph-Based Approach , 2010 .

[126]  Jutta Degener,et al.  Optimizing schema-last tuple-store queries in graphd , 2010, SIGMOD Conference.

[127]  David A. Bader,et al.  A Brief Study of Open Source Graph Databases , 2013, ArXiv.

[128]  Hai Jin,et al.  TripleBit: a Fast and Compact System for Large Scale RDF Data , 2013, Proc. VLDB Endow..

[129]  Peter A. Boncz,et al.  An early look at the LDBC social network benchmark's business intelligence workload , 2018, GRADES/NDA@SIGMOD/PODS.

[130]  Khuzaima Daudjee,et al.  Providing Serializability for Pregel-like Graph Processing Systems , 2016, EDBT.

[131]  Torsten Hoefler,et al.  High-Performance Distributed RMA Locks , 2016, HPDC.

[132]  Stefan Plantikow,et al.  Updating Graph Databases with Cypher , 2019, Proc. VLDB Endow..

[133]  Dániel Varró,et al.  Formalising opencypher Graph Queries in Relational Algebra , 2017, ADBIS.

[134]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[135]  Liu Chen,et al.  A Survey on NoSQL Stores , 2018, ACM Comput. Surv..

[136]  Marcelo Arenas,et al.  Foundations of Modern Query Languages for Graph Databases , 2016, ACM Comput. Surv..

[137]  Juan Sequeda,et al.  G-CORE: A Core for Future Graph Query Languages , 2017, SIGMOD Conference.

[138]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[139]  Amine Mhedhbi,et al.  Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins , 2019, Proc. VLDB Endow..

[140]  David A. Bader,et al.  A performance evaluation of open source graph databases , 2014, PPAA '14.

[141]  Torsten Hoefler,et al.  Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and Systems , 2019, IEEE Transactions on Parallel and Distributed Systems.

[142]  Josep-Lluís Larriba-Pey,et al.  Survey of Graph Database Performance on the HPC Scalable Graph Analysis Benchmark , 2010, WAIM Workshops.

[143]  -. Qiang,et al.  Graph Processing on GPUs , 2018, ACM Comput. Surv..

[144]  Ladislav Hluchý,et al.  Benchmarking Traversal Operations over Graph Databases , 2012, 2012 IEEE 28th International Conference on Data Engineering Workshops.

[145]  Torsten Hoefler,et al.  High-Performance Parallel Graph Coloring with Strong Guarantees on Work, Depth, and Quality , 2020, ArXiv.

[146]  Stefan Plantikow,et al.  openCypher: New Directions in Property Graph Querying , 2018, EDBT.

[147]  Anne Laurent,et al.  Representing history in graph-oriented NoSQL databases: A versioning system , 2013, Eighth International Conference on Digital Information Management (ICDIM 2013).