PigSPARQL: mapping SPARQL to Pig Latin

In this paper we investigate the scalable processing of complex SPARQL queries on very large RDF datasets. As underlying platform we use Apache Hadoop, an open source implementation of Google's MapReduce for massively parallelized computations on a computer cluster. We introduce PigSPARQL, a system which gives us the opportunity to process complex SPARQL queries on a MapReduce cluster. To this end, SPARQL queries are translated into Pig Latin, a data analysis language developed by Yahoo! Research. Pig Latin programs are executed by a series of MapReduce jobs on a Hadoop cluster. We evaluate the processing of SPARQL queries by means of PigSPARQL using the SP2Bench, a SPARQL specific performance benchmark and demonstrate that PigSPARQL enables a scalable execution of SPARQL queries based on Hadoop without any additional programming efforts.

[1]  Brian McBride,et al.  Jena: Implementing the RDF Model and Syntax Specification , 2001, SemWeb.

[2]  Frank van Harmelen,et al.  Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema , 2002, SEMWEB.

[3]  GhemawatSanjay,et al.  The Google file system , 2003 .

[4]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[5]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[6]  Olaf Hartig,et al.  The SPARQL Query Graph Model for Query Optimization , 2007, ESWC.

[7]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[8]  Orri Erling,et al.  Towards Web Scale RDF , 2008 .

[9]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[10]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[11]  Andy Seaborne,et al.  Clustered TDB: A Clustered Triple Store for Jena , 2008 .

[12]  Peter Mika,et al.  Web Semantics in the Clouds , 2008, IEEE Intelligent Systems.

[13]  Dave Reynolds,et al.  SPARQL basic graph pattern optimization using selectivity estimation , 2008, WWW.

[14]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[15]  Yon Dohn Chung,et al.  SPIDER: a system for scalable, parallel / distributed evaluation of large-scale RDF data , 2009, CIKM.

[16]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[18]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[19]  N. Shadbolt,et al.  4store: The Design and Implementation of a Clustered RDF Store , 2009 .

[20]  Sang-goo Lee,et al.  SPARQL basic graph pattern processing with iterative MapReduce , 2010, MDAC '10.

[21]  Michael Schmidt,et al.  Foundations of SPARQL query optimization , 2008, ICDT '10.

[22]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[23]  Padmashree Ravindra,et al.  Towards scalable RDF graph analytics on MapReduce , 2010, MDAC '10.

[24]  Bhavani M. Thuraisingham,et al.  Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[25]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[26]  Georg Lausen,et al.  PigSPARQL: Übersetzung von SPARQL nach Pig Latin , 2011, BTW.