Efficient Subgraph Matching on Large RDF Graphs Using MapReduce

With the popularity of knowledge graphs growing rapidly, large amounts of RDF graphs have been released, which raises the need for addressing the challenge of distributed subgraph matching queries. In this paper, we propose an efficient distributed method to answer subgraph matching queries on big RDF graphs using MapReduce. In our method, query graphs are decomposed into a set of stars that utilize the semantic and structural information embedded RDF graphs as heuristics. Two optimization techniques are proposed to further improve the efficiency of our algorithms. One algorithm, called RDF property filtering, filters out invalid input data to reduce intermediate results; the other is to improve the query performance by postponing the Cartesian product operations. The extensive experiments on both synthetic and real-world datasets show that our method outperforms the close competitors S2X and SHARD by an order of magnitude on average.

[1]  Bhavani M. Thuraisingham,et al.  Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing , 2011, IEEE Transactions on Knowledge and Data Engineering.

[2]  Sherif Sakr,et al.  DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication , 2015, Proc. VLDB Endow..

[3]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[4]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[5]  Georg Lausen,et al.  S2RDF: RDF Querying with SPARQL on Spark , 2015, Proc. VLDB Endow..

[6]  Lei Zou,et al.  gStore: a graph-based SPARQL query engine , 2014, The VLDB Journal.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[9]  Lijun Chang,et al.  Scalable Subgraph Enumeration in MapReduce , 2015, Proc. VLDB Endow..

[10]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[11]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[12]  Lei Chen,et al.  Distance-Aware Selective Online Query Processing Over Large Distributed Graphs , 2016, Data Science and Engineering.

[13]  M. Tamer Özsu,et al.  Diversified Stress Testing of RDF Data Management Systems , 2014, SEMWEB.

[14]  Jianzhong Li,et al.  Efficient Subgraph Matching on Billion Node Graphs , 2012, Proc. VLDB Endow..

[15]  Georg Lausen,et al.  S2X: Graph-Parallel Querying of RDF with GraphX , 2015, Big-O/DMAH@VLDB.

[16]  Martin E. Dyer,et al.  The complexity of counting graph homomorphisms , 2000, Random Struct. Algorithms.

[17]  Yinghui Wu,et al.  Fast top-k search in knowledge graphs , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[18]  Peng Peng,et al.  Processing SPARQL queries over distributed RDF graphs , 2014, The VLDB Journal.

[19]  Amit P. Sheth,et al.  Semantic Services, Interoperability and Web Applications - Emerging Concepts , 2011, Semantic Services, Interoperability and Web Applications.

[20]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[21]  Patrick Valduriez,et al.  Join indices , 1987, TODS.

[22]  Marcelo Arenas,et al.  Semantics and Complexity of SPARQL , 2006, International Semantic Web Conference.

[23]  G Stix,et al.  The mice that warred. , 2001, Scientific American.

[24]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[25]  Richard E. Schantz,et al.  High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store , 2010, PSI EtA '10.

[26]  Martin Theobald,et al.  TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing , 2014, SIGMOD Conference.

[27]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[28]  Georg Lausen,et al.  Sempala: Interactive SPARQL Query Processing on Hadoop , 2014, SEMWEB.

[29]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[30]  Orri Erling,et al.  Virtuoso: RDF Support in a Native RDBMS , 2009, Semantic Web Information Management.