论文信息 - Map-Side Merge Joins for Scalable SPARQL BGP Processing

Map-Side Merge Joins for Scalable SPARQL BGP Processing

In recent times, it has been widely recognized that, due to their inherent scalability, frameworks based on MapReduce are indispensable for so-called "Big Data" applications. However, for Semantic Web applications using SPARQL, there is still a demand for sophisticated MapReduce join techniques for processing basic graph patterns, which are at the core of SPARQL. Renowned for their stable and efficient performance, sort-merge joins have become widely used in DBMSs. In this paper, we demonstrate the adaptation of merge joins for SPARQL BGP processing with MapReduce. Our technique supports both n-way joins and sequences of join operations by applying merge joins within the map phase of MapReduce while the reduce phase is only used to fulfill the preconditions of a subsequent join iteration. Our experiments with the LUBM benchmark show an average performance benefit between 15% and 48% compared to other MapReduce based approaches while at the same time scaling linearly with the RDF dataset size.

[1] Yon Dohn Chung,et al. Parallel data processing with MapReduce: a survey , 2012, SGMD.

[2] Bhavani M. Thuraisingham,et al. Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing , 2011, IEEE Transactions on Knowledge and Data Engineering.

[3] Anthony K. H. Tung,et al. MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[4] Christopher Olston,et al. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[5] Orri Erling,et al. Virtuoso, a Hybrid RDBMS/Graph Column Store , 2012, IEEE Data Eng. Bull..

[6] Jürgen Umbrich,et al. YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[7] Jeff Heflin,et al. LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[8] Abraham Silberschatz,et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[9] Sam Shah,et al. The big data ecosystem at LinkedIn , 2013, SIGMOD '13.

[10] Daniel J. Abadi,et al. Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[11] Goetz Graefe,et al. Query evaluation techniques for large databases , 1993, CSUR.

[12] Liang Chen,et al. Handling data skew in parallel joins in shared-nothing systems , 2008, SIGMOD Conference.

[13] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[14] Laks V. S. Lakshmanan,et al. Proceedings of the 2008 ACM SIGMOD international conference on Management of data , 2008, SIGMOD 2008.

[15] Gerhard Weikum,et al. RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[16] E. Prud hommeaux,et al. SPARQL query language for RDF , 2011 .

[17] Georg Lausen,et al. Cascading Map-Side Joins over HBase for Scalable Join Processing , 2012, SSWS+HPCSW@ISWC.

[18] N. Shadbolt,et al. 4store: The Design and Implementation of a Clustered RDF Store , 2009 .

[19] Marcelo Arenas,et al. Semantics and Complexity of SPARQL , 2006, International Semantic Web Conference.

[20] Daniel J. Abadi,et al. Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[21] Georg Lausen,et al. PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[22] Douglas Stott Parker,et al. Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[23] Frank van Harmelen,et al. WebPIE: A Web-scale Parallel Inference Engine using MapReduce , 2012, J. Web Semant..

[24] Mirek Riedewald,et al. Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[25] Jie Wu,et al. Theory and Network Applications of Dynamic Bloom Filters , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[26] Jeffrey D. Ullman,et al. Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[27] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28] Vinay Setty,et al. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..