An In-depth Investigation of Large-scale RDF Relational Schema Optimizations Using Spark-SQL

This paper discusses one of the most significant challenges of large-scale RDF data processing over Apache Spark, the relational schema optimization. The choice of RDF partitioning techniques and storage formats using SparkSQL significantly impacts query performance. The impact of the relational schemas and the underlying data storage formats is indisputable; they significantly affect the query performance. Nevertheless, the trade-offs in different configurations have not been a subject of intensive study in the literature. This paper presents an in-depth investigation for practitioners to understand such trade-offs and their best practices. It also reports on the pitfalls behind the implementation SPARQL optimizations over SparkSQL. Our experiments provide insights into these schemas’ relative strengths by comparing three different partitioning techniques and four other storage formats. Our results draw a better understanding of the current State-Of-The-Art (S.O.T.A) and pave the way for a wide range of best practices and systematically tuning the performance of distributed systems to handle vast RDF data.

[1]  Georg Lausen,et al.  Sempala: Interactive SPARQL Query Processing on Hadoop , 2014, SEMWEB.

[2]  Georg Lausen,et al.  An Experimental Comparison of RDF Data Management Approaches in a SPARQL Benchmark Scenario , 2008, SEMWEB.

[3]  George Papastefanatos,et al.  Hierarchical Property Set Merging for SPARQL Query Optimization , 2020, DOLAP.

[4]  Kostas Stefanidis,et al.  RDF Query Answering Using Apache Spark: Review and Assessment , 2018, 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW).

[5]  Georg Lausen,et al.  S2RDF: RDF Querying with SPARQL on Spark , 2015, Proc. VLDB Endow..

[6]  Panos Kalnis,et al.  A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data , 2017, Proc. VLDB Endow..

[7]  Felix Conrads,et al.  How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benchmarks , 2019, WWW.

[8]  Sherif Sakr,et al.  Benchmarking Spark-SQL under Alliterative RDF Relational Storage Backends , 2019, QuWeDa@ISWC.

[9]  Matteo Pergolesi,et al.  The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet , 2019, Concurr. Comput. Pract. Exp..

[10]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[11]  Peter A. Boncz,et al.  Deriving an Emergent Relational Schema from RDF Data , 2015, WWW.

[12]  Michael Färber,et al.  PRoST: Distributed Execution of SPARQL Queries Using Mixed Partitioning Strategies , 2018, EDBT.

[13]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[14]  Muhammad Saleem,et al.  An Empirical Evaluation of RDF Graph Partitioning Techniques , 2018, EKAW.

[15]  Sherif Sakr,et al.  Towards making sense of Spark-SQL performance for processing vast distributed RDF datasets , 2020, SBD@SIGMOD.

[16]  Sherif Sakr,et al.  Relational processing of RDF queries: a survey , 2010, SGMD.

[17]  Feras M. Awaysheh,et al.  Next-Generation Big Data Federation Access Control: A Reference Model , 2019, Future Gener. Comput. Syst..

[18]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[19]  Sherif Sakr,et al.  GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries , 2009, DASFAA.

[20]  Alexandru Iosup,et al.  The future is big graphs , 2020, Commun. ACM.

[21]  Victor Anthony Arrascue Ayala,et al.  Relational schemata for distributed SPARQL query processing , 2019, SBD '19.

[22]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[23]  Lei Zou,et al.  gStore: Answering SPARQL Queries via Subgraph Matching , 2011, Proc. VLDB Endow..