Job-Optimized Map-Side Join Processing Using MapReduce and HBase with Abstract RDF Data

The amount of RDF data being published on the Web is increasing at a massive rate. MapReduce-based distributed frameworks have become the general trend in processing SPARQL queries against the RDF data. Currently, query processing systems that use MapReduce have not been able to keep up with increases in semantic annotated data, resulting in non-interactive SPARQL query processing. The principal reason is that intermediate query results from join operations in a MapReduce framework are so massive that network bandwidth and hard disk drive I/O speeds may not keep pace with the processing speed. In this paper, we present an efficient SPARQL processing system that uses MapReduce and HBase. The system runs a job optimized query plan using our proposed abstract RDF data to decrease the amount of intermediate data, thus resulting in faster query processing performance. We also present an efficient algorithm of using Map-side joins while also using the abstract RDF data to filter out unneeded RDF data. Experimental results show that the proposed approach demonstrates better performance when processing queries with a large set of inputs than those found in previous works.

[1]  J. Carroll,et al.  Jena: implementing the semantic web recommendations , 2004, WWW Alt. '04.

[2]  Dave Reynolds,et al.  SPARQL basic graph pattern optimization using selectivity estimation , 2008, WWW.

[3]  HyeongSik Kim,et al.  An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce , 2011, ESWC.

[4]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[5]  Jianling Sun,et al.  Scalable RDF store based on HBase and MapReduce , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[6]  Eugene Inseok Chong,et al.  An Efficient SQL-based RDF Querying Scheme , 2005, VLDB.

[7]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[8]  Frank van Harmelen,et al.  Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema , 2002, SEMWEB.

[9]  Nicholas Gibbins,et al.  3store: Efficient Bulk RDF Storage , 2003, PSSS.

[10]  Jan Hidders,et al.  A Structural Approach to Indexing Triples , 2012, ESWC.

[11]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[12]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[13]  Georg Lausen,et al.  Cascading Map-Side Joins over HBase for Scalable Join Processing , 2012, SSWS+HPCSW@ISWC.

[14]  Daniel J. Abadi,et al.  SW-Store: a vertically partitioned DBMS for Semantic Web data management , 2009, The VLDB Journal.

[15]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[16]  Martin Theobald,et al.  TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing , 2014, SIGMOD Conference.

[17]  Kyong-Ho Lee,et al.  RDFChain: Chain Centric Storage for Scalable Join Processing of RDF Graphs using MapReduce and HBase , 2013, International Semantic Web Conference.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Frank van Harmelen,et al.  WebPIE: A Web-scale Parallel Inference Engine using MapReduce , 2012, J. Web Semant..

[20]  Richard E. Schantz,et al.  High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store , 2010, PSI EtA '10.

[21]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[22]  Lei Zou,et al.  gStore: Answering SPARQL Queries via Subgraph Matching , 2011, Proc. VLDB Endow..

[23]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[24]  Bhavani M. Thuraisingham,et al.  Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing , 2011, IEEE Transactions on Knowledge and Data Engineering.

[25]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[26]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.