A data distribution model for RDF

The ever-increasing amount of RDF data made available requires data to be partitioned across multiple servers. We have witnessed some research progress made towards scaling RDF query processing based on suitable data distribution methods. In general, they work well for queries matching simple triple patterns, but they are not efficient for queries involving more complex patterns. In this paper, we present an RDF data distribution method which overcomes the shortcomings of the current approaches in order to scale RDF storage both on the volume of data and query processing. We apply a method that identifies frequent patterns accessed by queries in order to keep related data in the same partition. We deploy our reasoning on a summarized view of data in order to avoid exhaustive analysis on large datasets. As result, partitioning templates are obtained from data items in an RDF structure. In addition, we provide an approach for dynamic data insertions even if new data do not conform to the original RDF structure. Apart from the repartitioning approaches, we use an overflow repository to store data which may not follow the original schema. Our study shows that our method scales well and is effective to improve the overall performance by decreasing the amount of message passing among servers, compared to alternative data distribution approaches for RDF.

[1]  João Paulo,et al.  MeT: workload aware elasticity for NoSQL , 2013, EuroSys '13.

[2]  Oded Shmueli,et al.  An algorithm for partitioning trees augmented with sibling edges , 2008, Inf. Process. Lett..

[3]  Jeffrey Xu Yu,et al.  Catch the Wind: Graph workload balancing on cloud , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[4]  Shamkant B. Navathe,et al.  Vertical partitioning for database design: a graphical algorithm , 1989, SIGMOD '89.

[5]  Wenfei Fan,et al.  Distributed query evaluation with performance guarantees , 2007, SIGMOD '07.

[6]  Jae-Soo Yoo,et al.  Dynamic Partitioning of Large Scale RDF Graph in Dynamic Environments , 2018 .

[7]  Katja Hose,et al.  WARP: Workload-aware replication and partitioning for RDF , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[8]  Lei Zou,et al.  SPARQL Query Parallel Processing: A Survey , 2017, 2017 IEEE International Congress on Big Data (BigData Congress).

[9]  Carmem S. Hara,et al.  Affinity­based XML Fragmentation , 2012, WebDB.

[10]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[11]  Minh-Duc Pham Self-organizing structured RDF in MonetDB , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[12]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[13]  Walid G. Aref,et al.  WORQ: Workload-Driven RDF Query Processing , 2018, SEMWEB.

[14]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[15]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[16]  Carmem S. Hara,et al.  Partitioning Templates for RDF , 2015, ADBIS.

[17]  Samuel Madden,et al.  A robust partitioning scheme for ad-hoc query workloads , 2017, SoCC.

[18]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[19]  Xiaoyong Du,et al.  Efficient SPARQL Query Evaluation via Automatic Data Partitioning , 2013, DASFAA.

[20]  Raqueline Ritter de Moura Penteado Otimização de consultas SPARQL em bases RDF distribuídas , 2017 .

[21]  Lu Wang,et al.  How to partition a billion-node graph , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[22]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[23]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[24]  Daniel J. Abadi,et al.  SW-Store: a vertically partitioned DBMS for Semantic Web data management , 2009, The VLDB Journal.

[25]  Pengcheng Xiong Dynamic management of resources and workloads for RDBMS in cloud: a control-theoretic approach , 2012, PhD '12.

[26]  Abdul Quamar,et al.  SWORD: scalable workload-aware data placement for transactional workloads , 2013, EDBT '13.

[27]  M. Tamer Özsu,et al.  Building self-clustering RDF databases using Tunable-LSH , 2018, The VLDB Journal.

[28]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[29]  Ian Rae,et al.  F1: A Distributed SQL Database That Scales , 2013, Proc. VLDB Endow..

[30]  Alfredo Cuzzocrea,et al.  Horizontal partitioning of very-large data warehouses under dynamically-changing query workloads via incremental algorithms , 2013, SAC '13.

[31]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[32]  Florian Schintke,et al.  Scalaris: reliable transactional p2p key/value store , 2008, ERLANG '08.

[33]  Wolfgang Nejdl,et al.  Design issues and challenges for RDF- and schema-based peer-to-peer systems , 2003, SGMD.

[34]  Gang Wu,et al.  A Workload-Based Partitioning Scheme for Parallel RDF Data Processing , 2012, CSWS.