Compressed vertical partitioning for efficient RDF management

The Web of Data has been gaining momentum in recent years. This leads to increasingly publish more and more semi-structured datasets following, in many cases, the RDF (Resource Description Framework) data model based on atomic triple units of subject, predicate, and object. Although it is a very simple model, specific compression methods become necessary because datasets are increasingly larger and various scalability issues arise around their organization and storage. This requirement is even more restrictive in RDF stores because efficient SPARQL solution on the compressed RDF datasets is also required. This article introduces a novel RDF indexing technique that supports efficient SPARQL solution in compressed space. Our technique, called $$\hbox {k}^2$$k2-triples, uses the predicate to vertically partition the dataset into disjoint subsets of pairs (subject, object), one per predicate. These subsets are represented as binary matrices of subjects $$\times $$× objects in which 1-bits mean that the corresponding triple exists in the dataset. This model results in very sparse matrices, which are efficiently compressed using $$\hbox {k}^2$$k2-trees. We enhance this model with two compact indexes listing the predicates related to each different subject and object in the dataset, in order to address the specific weaknesses of vertically partitioned representations. The resulting technique not only achieves by far the most compressed representations, but also achieves the best overall performance for RDF retrieval in our experimental setup. Our approach uses up to 10 times less space than a state-of-the-art baseline and outperforms its time performance by several orders of magnitude on the most basic query patterns. In addition, we optimize traditional join algorithms on $$\hbox {k}^2$$k2-triples and define a novel one leveraging its specific features. Our experimental results show that our technique also overcomes traditional vertical partitioning for join solution, reporting the best numbers for joins in which the non-joined nodes are provided, and being competitive in most of the cases.

[1]  Sherif Sakr,et al.  G-SPARQL: a hybrid engine for querying large attributed graphs , 2012, CIKM.

[2]  Henry Lieberman,et al.  Sesame: An Architecture for Storing and Querying RDF Data and Schema Information , 2005 .

[3]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[4]  Daniel J. Abadi,et al.  SW-Store: a vertically partitioned DBMS for Semantic Web data management , 2009, The VLDB Journal.

[5]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[6]  Annika Hinze,et al.  Storing RDF as a graph , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[7]  Gonzalo Navarro,et al.  Compact representation of Web graphs with extended functionality , 2014, Inf. Syst..

[8]  Krys J. Kochut,et al.  BRAHMS: A WorkBench RDF Store and High Performance Memory System for Semantic Association Discovery , 2005, SEMWEB.

[9]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[10]  Dongwon Jeong,et al.  SPARQL graph pattern rewriting for OWL-DL inference queries , 2009, Knowledge and Information Systems.

[11]  Andreas Harth,et al.  Optimized index structures for querying RDF from the Web , 2005, Third Latin American Web Congress (LA-WEB'2005).

[12]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[13]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[14]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[15]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[16]  Sherif Sakr,et al.  Relational processing of RDF queries: a survey , 2010, SGMD.

[17]  Frank van Harmelen,et al.  Sesame: An Architecture for Storin gand Querying RDF Data and Schema Information , 2003, Spinning the Semantic Web.

[18]  David Sánchez,et al.  Content annotation for the semantic web: an automatic web-based approach , 2011, Knowledge and Information Systems.

[19]  Chris Wimlett Database management systems: A guide to microcomputer software: David Kruglinski. Published by Osborne/McGraw-Hill 260pp. £12.95 , 1983 .

[20]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[21]  Siegfried Handschuh,et al.  From raw publications to Linked Data , 2011, Knowledge and Information Systems.

[22]  Miguel A. Martínez-Prieto,et al.  Querying RDF dictionaries in compressed space , 2012, SIAP.

[23]  Susana Ladra,et al.  Practical representations for web and social graphs , 2011, CIKM '11.

[24]  James A. Hendler,et al.  Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data , 2010, WWW '10.

[25]  Gonzalo Navarro,et al.  Compressed Dynamic Binary Relations , 2012, 2012 Data Compression Conference.

[26]  Martin L. Kersten,et al.  Column-store support for RDF data management: not all swans are white , 2008, Proc. VLDB Endow..

[27]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[28]  Eva Zangerle,et al.  SpiderStore: A Native Main Memory Approach for Graph Storage , 2011, Grundlagen von Datenbanken.

[29]  Claudio Gutiérrez,et al.  Querying RDF Data from a Graph Database Perspective , 2005, ESWC.

[30]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[31]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[32]  Gonzalo Navarro,et al.  DACs: Bringing direct access to variable-length codes , 2013, Inf. Process. Manag..

[33]  Claudio Gutiérrez,et al.  Bipartite Graphs as Intermediate Model for RDF , 2004, SEMWEB.

[34]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[35]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[36]  Donald E. Knuth,et al.  The Art of Computer Programming, Vol. 3: Sorting and Searching , 1974 .

[37]  Axel Polleres,et al.  Binary RDF representation for publication and exchange (HDT) , 2013, J. Web Semant..

[38]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[39]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[40]  Jacopo Urbani,et al.  Massive Semantic Web data compression with MapReduce , 2010, HPDC '10.

[41]  Kevin Wilkinson,et al.  Jena Property Table Implementation , 2006 .

[42]  Georg Lausen,et al.  An Experimental Comparison of RDF Data Management Approaches in a SPARQL Benchmark Scenario , 2008, SEMWEB.

[43]  Nicholas Gibbins,et al.  3store: Efficient Bulk RDF Storage , 2003, PSSS.

[44]  Sven Groppe,et al.  Data Management and Query Processing in Semantic Web Databases , 2011 .

[45]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[46]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[47]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.