Adaptive Partitioning for Very Large RDF Data

Distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation, while others apply heuristics aiming at minimizing inter-node communication during query evaluation. This requires an expensive data preprocessing phase, leading to high startup costs for very large RDF knowledge bases. Apriori knowledge of the query workload has also been used to create partitions, which however are static and do not adapt to workload changes; hence, inter-node communication cannot be consistently avoided for queries that are not favored by the initial data partitioning. In this paper, we propose AdHash, a distributed RDF system, which addresses the shortcomings of previous work. First, AdHash applies lightweight partitioning on the initial data, that distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdHash takes full advantage of the partitioning to (i)support the fully parallel processing of join patterns on subjects and (ii) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdHash monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdHash implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds.

[1]  M. Tamer Özsu,et al.  Workload Matters: Why RDF Databases Need a New Design , 2014, Proc. VLDB Endow..

[2]  Panos Constantopoulos,et al.  Optimizing Query Shortcuts in RDF Databases , 2011, ESWC.

[3]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[4]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[5]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[6]  Min Wang,et al.  EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[7]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[8]  Ling Liu,et al.  Scaling Queries over Big RDF Graphs with Semantic Hash Partitioning , 2013, Proc. VLDB Endow..

[9]  Martin Theobald,et al.  TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing , 2014, SIGMOD Conference.

[10]  Guilin Qi,et al.  RDF pattern matching using sortable views , 2012, CIKM.

[11]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[12]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[13]  Adina Crainiceanu,et al.  Rya: a scalable RDF triple store for the clouds , 2012, Cloud-I '12.

[14]  James A. Hendler,et al.  Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data , 2010, WWW '10.

[15]  L. N. Bol’shev,et al.  Chauvenet’s Test in the Classical Theory of Errors , 1975 .

[16]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[17]  Jignesh M. Patel,et al.  Design and evaluation of main memory hash join algorithms for multi-core CPUs , 2011, SIGMOD '11.

[18]  Robert S. Boyer,et al.  MJRTY: A Fast Majority Vote Algorithm , 1991, Automated Reasoning: Essays in Honor of Woody Bledsoe.

[19]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[20]  Hai Jin,et al.  Scalable SPARQL querying using path partitioning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[21]  Gang Chen,et al.  Fast Failure Recovery in Distributed Graph Processing Systems , 2014, Proc. VLDB Endow..

[22]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[23]  Bhavani M. Thuraisingham,et al.  Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing , 2011, IEEE Transactions on Knowledge and Data Engineering.

[24]  汪卫 How to partition a billion-Node graph , 2014 .

[25]  François Goasdoué,et al.  View Selection in Semantic Web Databases , 2011, Proc. VLDB Endow..

[26]  Bo Zong,et al.  Towards effective partition management for large graphs , 2012, SIGMOD Conference.

[27]  Lei Zou,et al.  gStore: a graph-based SPARQL query engine , 2014, The VLDB Journal.

[28]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[29]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[30]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[31]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[32]  Richard E. Schantz,et al.  High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store , 2010, PSI EtA '10.

[33]  Rinke Hoekstra,et al.  Structural Properties as Proxy for Semantic Relevance in RDF Graph Sampling , 2014, SEMWEB.

[34]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[35]  Katja Hose,et al.  Partout: a distributed engine for efficient RDF processing , 2012, WWW.

[36]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[37]  Ioannis Konstantinou,et al.  H2RDF+: High-performance distributed joins over large-scale RDF graphs , 2013, 2013 IEEE International Conference on Big Data.

[38]  Hai Jin,et al.  TripleBit: a Fast and Compact System for Large Scale RDF Data , 2013, Proc. VLDB Endow..

[39]  Katja Hose,et al.  WARP: Workload-aware replication and partitioning for RDF , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).