JOTR: Join-Optimistic Triple Reordering Approach for SPARQL Query Optimization on Big RDF Data

Resource Description Framework (RDF) is increasingly being used for representing information on the web. This popularity has made storage of large RDF data a difficult task. To overcome these issues many distributed RDF systems are being proposed that can store and efficiently process Big RDF data. Hadoop framework is widely being used for storing and handling a large amount of RDF data. One of the major obstacles faced while handling this large amount of RDF data is query processing on such large datasets. In this paper, we present JOTR: a SPARQL query optimization technique for Big RDF data using triple pattern reordering on a distributed Hadoop based RDF system. The proposed technique is based on selectivity calculation and has been tested on one of the popular RDF benchmark datasets, LUBM dataset. We have tested JOTR on large sized RDF datasets and compared it with other optimization approaches in respect to the query execution time. From the results, it can be concluded that our approach gives a notable performance on distributed RDF systems and thus is applicable to centralized systems as well.

[1]  Georg Lausen,et al.  PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[2]  Emmanuel S. Pilli,et al.  Research issues in RDF management systems , 2016, 2016 International Conference on Emerging Trends in Communication Technologies (ETCT).

[3]  Tianyu Wo,et al.  ScalaRDF: A Distributed, Elastic and Scalable In-Memory RDF Triple Store , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[4]  Sherif Sakr,et al.  DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication , 2015, Proc. VLDB Endow..

[5]  Panos Kalnis,et al.  Query Optimizations over Decentralized RDF Graphs , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[6]  Derya Birant,et al.  An ant colony optimisation approach for optimising SPARQL queries by reordering triple patterns , 2015, Inf. Syst..

[7]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[8]  Mehreen Ali,et al.  Processing RDF Using Hadoop , 2012, ACITY.

[9]  Ioana Manolescu,et al.  RDF in the clouds: a survey , 2014, The VLDB Journal.

[10]  Nasser Ghadiri,et al.  Linked data partitioning for RDF processing on Apache Spark , 2017, 2017 3th International Conference on Web Research (ICWR).

[11]  Minal Bhise,et al.  DWAHP: Workload Aware Hybrid Partitioning and Distribution of RDF Data , 2017, IDEAS.

[12]  Richard E. Schantz,et al.  High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store , 2010, PSI EtA '10.

[13]  Steffen Staab,et al.  Federated Data Management and Query Optimization for Linked Open Data , 2011, New Directions in Web Data Management 1.

[14]  Rong Gu,et al.  Rainbow: A distributed and hierarchical RDF triple store with dynamic scalability , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[15]  Dave Reynolds,et al.  SPARQL basic graph pattern optimization using selectivity estimation , 2008, WWW.

[16]  Kyong-Ho Lee,et al.  Job-Optimized Map-Side Join Processing Using MapReduce and HBase with Abstract RDF Data , 2015, 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).

[17]  George Papastefanatos,et al.  Distance-Based Triple Reordering for SPARQL Query Optimization , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[18]  Georg Lausen,et al.  S2RDF: RDF Querying with SPARQL on Spark , 2015, Proc. VLDB Endow..

[19]  Rajshekhar Sunderraman,et al.  Distributed Graph Path Queries Using Spark , 2016, 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).

[20]  Haruo Yokota,et al.  JARS: Join-Aware Distributed RDF Storage , 2016, IDEAS.