P-Spar(k)ql: SPARQL Evaluation Method on Spark GraphX with Parallel Query Plan

The Semantic Data are built from triples, that contain subjects, predicates and objects. On the other hand we can consider the triples as edges. The subject and the object are the nodes and the predicate is the label of the edge. In this view the Semantic Data define a graph. This graph can be very large, because a Semantic Dataset contains millions of triples. To query this dataset we can use the SPARQL query language. Since the Big Data tools appeared the researchers try to evaluate the SPARQL with that tools. In the last few year the distributed graph analytic tools appeared too. So the challenge is: use the graph analytic tools to evaluate the semantic query on the semantic graph. In this paper we present the PSparkql that extends the Sparkql with parallel query plan. The system uses the Spark GraphX distributed graph analytic tool. We show less edges enough for the evaluation than the Sparkql is using. We also collect some statistics (number of predicates, data properties) about the graph to change the evaluation order of the SPARQL query. We compare our results with related works: the Sparkql and the S2X.

[1]  Dirk Grunwald,et al.  Using vertex-centric programming platforms to implement SPARQL queries on large graphs , 2014, IA3 '14.

[2]  Bhavani M. Thuraisingham,et al.  Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce , 2009, CloudCom.

[3]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4]  Georg Lausen,et al.  S2X: Graph-Parallel Querying of RDF with GraphX , 2015, Big-O/DMAH@VLDB.

[5]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[6]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[7]  Marcelo Arenas,et al.  Semantics and Complexity of SPARQL , 2006, International Semantic Web Conference.

[8]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[9]  Siegfried Handschuh,et al.  Learning from Linked Open Data Usage: Patterns & Metrics , 2010 .

[10]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[11]  Ioannis Konstantinou,et al.  H2RDF: adaptive query processing on RDF data in the cloud. , 2012, WWW.

[12]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[13]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[14]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[15]  Guilin Qi,et al.  HadoopSPARQL: A Hadoop-Based Engine for Multiple SPARQL Query Answering , 2012, ESWC.

[16]  Gergo Gombos,et al.  Spar(k)ql: SPARQL Evaluation Method on Spark GraphX , 2016, 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW).

[17]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[18]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[19]  Georg Lausen,et al.  PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[20]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[21]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[22]  Panos Kalnis,et al.  SPARTex: A Vertex-Centric Framework for RDF Data Analytics , 2015, Proc. VLDB Endow..

[23]  Georg Lausen,et al.  Sempala: Interactive SPARQL Query Processing on Hadoop , 2014, SEMWEB.