High-Performance Data Analytics Beyond the Relational and Graph Data Models with GEMS

Graphs represent an increasingly popular data model for data-analytics, since they can naturally represent relationships and interactions between entities. Relational databases and their pure table-based data model are not well suitable to store and process sparse data. Consequently, graph databases have gained interest in the last few years and the Resource Description Framework (RDF) became the standard data model for graph data. Nevertheless, while RDF is well suited to analyze the relationships between the entities, it is not efficient in representing their attributes and properties. In this work we propose the adoption of a new hybrid data model, based on attributed graphs, that aims at overcoming the limitations of the pure relational and graph data models. We present how we have re-designed the GEMS data-analytics framework to fully take advantage of the proposed hybrid data model. To improve analysts productivity, in addition to a C++ API for applications development, we adopt GraQL as input query language. We validate our approach implementing a set of queries on net-flow data and we compare our framework performance against Neo4j. Experimental results show significant performance improvement over Neo4j, up to several orders of magnitude when increasing the size of the input data.

[1]  Mateo Valero,et al.  Scaling Irregular Applications through Data Aggregation and Software Multithreading , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[2]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[3]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[4]  Vito Giovanni Castellana,et al.  In-Memory Graph Databases for Web-Scale Data , 2015, Computer.

[5]  Vito Giovanni Castellana,et al.  Toward a data scalable solution for facilitating discovery of science resources , 2014, Parallel Comput..

[6]  Vito Giovanni Castellana,et al.  Scaling Semantic Graph Databases in Size and Performance , 2014, IEEE Micro.

[7]  Chavarria-Miranda Daniel,et al.  GraQL: A Query Language for High-Performance Attributed Graph Databases , 2016 .

[8]  Alan L. Cox,et al.  The Hadoop distributed filesystem: Balancing portability and performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[9]  Salim Jouili,et al.  An Empirical Comparison of Graph Databases , 2013, 2013 International Conference on Social Computing.

[10]  Jim Webber,et al.  A programmatic introduction to Neo4j , 2018, SPLASH '12.

[11]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[12]  Gang Hu,et al.  SQLGraph: An Efficient Relational-Based Property Graph Store , 2015, SIGMOD Conference.

[13]  Daniel J. Abadi,et al.  SW-Store: a vertically partitioned DBMS for Semantic Web data management , 2009, The VLDB Journal.

[14]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[15]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[16]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[17]  Fan Yang,et al.  A General-Purpose Query-Centric Framework for Querying Big Graphs , 2016, Proc. VLDB Endow..

[18]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.