RDF-TR: Exploiting structural redundancies to boost RDF compression

Abstract The number and volume of semantic data have grown impressively over the last decade, promoting compression as an essential tool for RDF preservation, sharing and management. In contrast to universal compressors, RDF compression techniques are able to detect and exploit specific forms of redundancy in RDF data. Thus, state-of-the-art RDF compressors excel at exploiting syntactic and semantic redundancies, i.e., repetitions in the serialization format and information that can be inferred implicitly. However, little attention has been paid to the existence of structural patterns within the RDF dataset; i.e. structural redundancy. In this paper, we analyze structural regularities in real-world datasets, and show three schema-based sources of redundancies that underpin the schema-relaxed nature of RDF. Then, we propose RDF-Tr (RDF Triples Reorganizer), a preprocessing technique that discovers and removes this kind of redundancy before the RDF dataset is effectively compressed. In particular, RDF-Tr groups subjects that are described by the same predicates, and locally re-codes the objects related to these predicates. Finally, we integrate RDF-Tr with two RDF compressors, HDT and k2-triples . Our experiments show that using RDF-Tr with these compressors improves by up to 2.3 times their original effectiveness, outperforming the most prominent state-of-the-art techniques.

[1]  Pascal Hitzler,et al.  Logical Linked Data Compression , 2013, ESWC.

[2]  Conor Hayes,et al.  SemStim at the LOD-RecSys 2014 Challenge , 2014, SemWebEval@ESWC.

[3]  Steffen Staab,et al.  Impact analysis of data placement strategies on query efforts in distributed RDF stores , 2018, J. Web Semant..

[4]  P. Sreenivasa Kumar,et al.  Horn-rule based compression technique for RDF data , 2015, SAC.

[5]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[6]  Pablo de la Fuente,et al.  Characterising RDF data sets , 2018, J. Inf. Sci..

[7]  Sebastian Maneth,et al.  Grammar-Based Graph Compression , 2017, Inf. Syst..

[8]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[9]  Nieves R. Brisaboa,et al.  Practical compressed string dictionaries , 2016, Inf. Syst..

[10]  Miguel A. Martínez-Prieto,et al.  Serializing RDF in Compressed Space , 2015, 2015 Data Compression Conference.

[11]  Axel Polleres,et al.  Binary RDF representation for publication and exchange (HDT) , 2013, J. Web Semant..

[12]  Jakub Swacha,et al.  OFR: An Efficient Representation of RDF Datasets , 2015, SLATE.

[13]  Michael Meier,et al.  Towards Rule-Based Minimization of RDF Graphs under Constraints , 2008, RR.

[14]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[15]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[16]  Miguel A. Martínez-Prieto,et al.  Exchange and Consumption of Huge RDF Data , 2012, ESWC.

[17]  Gonzalo Navarro,et al.  Compact representation of Web graphs with extended functionality , 2014, Inf. Syst..

[18]  Jeff Z. Pan,et al.  Graph Pattern Based RDF Data Compression , 2014, JIST.

[19]  Nieves R. Brisaboa,et al.  A Compact RDF Store Using Suffix Arrays , 2015, SPIRE.

[20]  Claudio Gutiérrez,et al.  RDF Compression , 2019, Encyclopedia of Big Data Technologies.

[21]  David Richard Clark,et al.  Compact pat trees , 1998 .

[22]  Richard Chbeir,et al.  Toward RDF Normalization , 2015, ER.

[23]  Muhammad Imran,et al.  Managing big RDF data in clouds: Challenges, opportunities, and solutions , 2018 .

[24]  Tania Tudorache,et al.  A systematic analysis of term reuse and term overlap across biomedical ontologies , 2017, Semantic Web.

[25]  Olivier Curé,et al.  WaterFowl: A Compact, Self-indexed and Inference-Enabled Immutable RDF Store , 2014, ESWC.

[26]  Yves Raimond,et al.  RDF 1.1 Primer , 2014 .

[27]  Pascal Hitzler,et al.  Alignment Aware Linked Data Compression , 2015, JIST.

[28]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[29]  James A. Hendler,et al.  Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data , 2010, WWW '10.

[30]  Carlos Bobed,et al.  Measuring structural similarity between RDF graphs , 2018, SAC.

[31]  Hala Skaf-Molli,et al.  SaGe: Web Preemption for Public SPARQL Query Services , 2019, WWW.

[32]  Nieves R. Brisaboa,et al.  Compressed vertical partitioning for efficient RDF management , 2014, Knowledge and Information Systems.

[33]  Rik Van de Walle,et al.  Querying Datasets on the Web with High Availability , 2014, SEMWEB.

[34]  Li Huang,et al.  Detect Redundant RDF Data by Rules , 2016, DASFAA Workshops.

[35]  Miguel A. Martínez-Prieto,et al.  Compression of RDF dictionaries , 2012, SAC '12.

[36]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[37]  Bernardo Cuenca Grau,et al.  OWL 2 Web Ontology Language: Profiles , 2009 .

[38]  Octavian Udrea,et al.  Apples and oranges: a comparison of RDF benchmarks and real RDF datasets , 2011, SIGMOD '11.

[39]  Ruben Verborgh,et al.  Reflections on: Triple Storage for Random-Access Versioned Querying of RDF Archives , 2018, JT@ISWC.

[40]  Luigi Iannone,et al.  Optimizing RDF Storage Removing Redundancies: An Algorithm , 2005, IEA/AIE.