Building self-clustering RDF databases using Tunable-LSH

The Resource Description Framework (RDF) is a W3C standard for representing graph-structured data, and SPARQL is the standard query language for RDF. Recent advances in information extraction, linked data management and the Semantic Web have led to a rapid increase in both the volume and the variety of RDF data that are publicly available. As businesses start to capitalize on RDF data, RDF data management systems are being exposed to workloads that are far more diverse and dynamic than what they were designed to handle. Consequently, there is a growing need for developing workload-adaptive and self-tuning RDF data management systems. To realize this objective, we introduce a fast and efficient method for dynamically clustering records in an RDF data management system. Specifically, we assume nothing about the workload upfront, but as SPARQL queries are executed, we keep track of records that are co-accessed by the queries in the workload and physically cluster them. To decide dynamically and in constant-time where a record needs to be placed in the storage system, we develop a new locality-sensitive hashing (LSH) scheme, Tunable-LSH. Using Tunable-LSH, records that are co-accessed across similar sets of queries can be hashed to the same or nearby physical pages in the storage system. What sets Tunable-LSH apart from existing LSH schemes is that it can auto-tune to achieve the aforementioned clustering objective with high accuracy even when the workloads change. Experimental evaluation of Tunable-LSH in an RDF data management system as well as in a standalone hashtable shows end-to-end performance gains over existing solutions.

[1]  M. Tamer Özsu,et al.  chameleon-db: a Workload-Aware Robust RDF Data Management System , 2013 .

[2]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[3]  Lei Zou,et al.  Redesign of the gStore system , 2018, Frontiers of Computer Science.

[4]  Matthew Chalmers,et al.  Fast Multidimensional Scaling Through Sampling, Springs and Interpolation , 2003, Inf. Vis..

[5]  Hannah Bast,et al.  QLever: A Query Engine for Efficient SPARQL+Text Search , 2017, CIKM.

[6]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[7]  M. Tamer Özsu,et al.  Diversified Stress Testing of RDF Data Management Systems , 2014, SEMWEB.

[8]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[9]  Kevin Wilkinson,et al.  Jena Property Table Implementation , 2006 .

[10]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[11]  Piero Fraternali,et al.  Graph Search of Software Models Using Multidimensional Scaling , 2015, EDBT/ICDT Workshops.

[12]  Julian Dolby,et al.  Building an efficient RDF store over a relational database , 2013, SIGMOD '13.

[13]  K. French,et al.  Expected stock returns and volatility , 1987 .

[14]  Daniel J. Abadi,et al.  SW-Store: a vertically partitioned DBMS for Semantic Web data management , 2009, The VLDB Journal.

[15]  Panos Kalnis,et al.  Evaluating SPARQL Queries on Massive RDF Datasets , 2015, Proc. VLDB Endow..

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  Orri Erling,et al.  Virtuoso, a Hybrid RDBMS/Graph Column Store , 2012, IEEE Data Eng. Bull..

[18]  E. Krause,et al.  Taxicab Geometry: An Adventure in Non-Euclidean Geometry , 1987 .

[19]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[20]  Katja Hose,et al.  Partout: a distributed engine for efficient RDF processing , 2012, WWW.

[21]  Shamkant B. Navathe,et al.  Distribution Design of Logical Database Schemas , 1983, IEEE Transactions on Software Engineering.

[22]  J. Carroll,et al.  Jena: implementing the semantic web recommendations , 2004, WWW Alt. '04.

[23]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[24]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[25]  Panagiotis Papapetrou,et al.  Nearest Neighbor Retrieval Using Distance-Based Hashing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[26]  Hai Jin,et al.  TripleBit: a Fast and Compact System for Large Scale RDF Data , 2013, Proc. VLDB Endow..

[27]  Katja Hose,et al.  WARP: Workload-aware replication and partitioning for RDF , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[28]  Roland H. C. Yap,et al.  Stochastic Database Cracking: Towards Robust Adaptive Indexing in Main-Memory Column-Stores , 2012, Proc. VLDB Endow..

[29]  Gerhard Weikum,et al.  x-RDF-3X , 2010, Proc. VLDB Endow..

[30]  M. Tamer Özsu,et al.  Workload Matters: Why RDF Databases Need a New Design , 2014, Proc. VLDB Endow..

[31]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[32]  Martin Theobald,et al.  TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing , 2014, SIGMOD Conference.

[33]  Günes Aluç,et al.  Parametric Plan Caching Using Density-Based Clustering , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[34]  Surajit Chaudhuri,et al.  Table of Contents (pdf) , 2007, VLDB.

[35]  Liang Chen,et al.  Stylus: A Strongly-Typed Store for Serving Massive RDF Data , 2017, Proc. VLDB Endow..

[36]  Panos Kalnis,et al.  Efficient and accurate nearest neighbor and closest pair search in high-dimensional space , 2010, TODS.

[37]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[38]  Ioannis Konstantinou,et al.  H2RDF: adaptive query processing on RDF data in the cloud. , 2012, WWW.

[39]  François Goasdoué,et al.  View Selection in Semantic Web Databases , 2011, Proc. VLDB Endow..

[40]  Surajit Chaudhuri,et al.  To tune or not to tune?: a lightweight physical design alerter , 2006, VLDB.

[41]  M. Tamer Özsu,et al.  Executing queries over schemaless RDF databases , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[42]  W. Reed The Normal-Laplace Distribution and Its Relatives , 2006 .

[43]  Bhavani M. Thuraisingham,et al.  Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing , 2011, IEEE Transactions on Knowledge and Data Engineering.

[44]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[45]  Fiona Fui-Hoon Nah,et al.  A study on tolerable waiting time: how long are Web users willing to wait? , 2004, AMCIS.

[46]  Divyakant Agrawal,et al.  Approximate nearest neighbor searching in multimedia databases , 2001, Proceedings 17th International Conference on Data Engineering.

[47]  Richard W. Hamming,et al.  Coding and Information Theory , 1980 .

[48]  Panos Kalnis,et al.  PHD-Store: An Adaptive SPARQL Engine with Dynamic Partitioning for Distributed RDF Repositories , 2014, ArXiv.

[49]  Steven K. Feiner,et al.  Computer graphics: principles and practice (2nd ed.) , 1990 .

[50]  Sam Lightstone,et al.  Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more , 2007 .

[51]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[52]  Harumi A. Kuno,et al.  Merging What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores , 2011, Proc. VLDB Endow..

[53]  Charu C. Aggarwal,et al.  A Survey of Stream Clustering Algorithms , 2018, Data Clustering: Algorithms and Applications.

[54]  Bu-Sung Lee,et al.  From Linked Data to Relevant Data -- Time is the Essence , 2011, ArXiv.

[55]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[56]  Lei Zou,et al.  gStore: Answering SPARQL Queries via Subgraph Matching , 2011, Proc. VLDB Endow..

[57]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[58]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[59]  Mohamed Ziauddin,et al.  Materialized Views in Oracle , 1998, VLDB.

[60]  Martin L. Kersten,et al.  Column-store support for RDF data management: not all swans are white , 2008, Proc. VLDB Endow..

[61]  Steven K. Feiner,et al.  Computer Graphics - Principles and Practice, 3rd Edition , 1990 .

[62]  Dimitrios Tsoumakos,et al.  Graph-Aware, Workload-Adaptive SPARQL Query Caching , 2015, SIGMOD Conference.

[63]  Günes Aluç,et al.  Workload Matters: A Robust Approach to Physical RDF Database Design , 2015 .

[64]  Latifur Khan,et al.  Materializing and Persisting Inferred and Uncertain Knowledge in RDF Datasets , 2010, AAAI.

[65]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[66]  Jun Sakuma,et al.  Fast approximate similarity search in extremely high-dimensional data sets , 2005, 21st International Conference on Data Engineering (ICDE'05).

[67]  Panos Kalnis,et al.  Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning , 2016, The VLDB Journal.

[68]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[69]  M. Carter Computer graphics: Principles and practice , 1997 .

[70]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[71]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[72]  Sam Lightstone,et al.  DB2 Design Advisor: Integrated Automatic Physical Database Design , 2004, VLDB.

[73]  Harumi A. Kuno,et al.  Benchmarking Adaptive Indexing , 2010, TPCTC.

[74]  Jens Lehmann,et al.  DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data , 2011, SEMWEB.