A Compact RDF Store Using Suffix Arrays

RDF has become a standard format to describe resources in the Semantic Web and other scenarios. RDF data is composed of triples subject,i¾źpredicate,i¾źobject, referring respectively to a resource, a property of that resource, and the value of such property. Compact storage schemes allow fitting larger datasets in main memory for faster processing. On the other hand, supporting efficient SPARQL queries on RDF datasets requires index data structures to accompany the data, which hampers compactness. As done for text collections, we introduce a self-index for RDF data, which combines the data and its index in a single representation that takes less space than the raw triples and efficiently supports basic SPARQL queries. Our storage format, RDFCSA, builds on compressed suffix arrays. Although there exist more compact representations of RDF data, RDFCSA uses about half of the space of the raw data and replaces it and displays much more robust and predictable query times around 1---2 microseconds per retrieved triple. RDFCSA is 3 orders of magnitude faster than representations like MonetDB or RDF-3X, while using the same space as the former and 6 times less space than the latter. It is also faster than the more compact representations on most queries, in some cases by 2 orders of magnitude.

[1]  Axel Polleres,et al.  Binary RDF representation for publication and exchange (HDT) , 2013, J. Web Semant..

[2]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[3]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[4]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[5]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[6]  Gonzalo Navarro,et al.  Efficient Fully-Compressed Sequence Representations , 2012, Algorithmica.

[7]  David Richard Clark,et al.  Compact pat trees , 1998 .

[8]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[9]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[10]  Paolo Ferragina,et al.  Compressed permuterm index , 2007, SIGIR.

[11]  Sherif Sakr,et al.  Relational processing of RDF queries: a survey , 2010, SGMD.

[12]  Miguel A. Martínez-Prieto,et al.  Querying RDF dictionaries in compressed space , 2012, SIAP.

[13]  Gonzalo Navarro,et al.  Compact representation of Web graphs with extended functionality , 2014, Inf. Syst..

[14]  Dongwon Jeong,et al.  SPARQL graph pattern rewriting for OWL-DL inference queries , 2009, Knowledge and Information Systems.

[15]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[16]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[17]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[18]  James A. Hendler,et al.  Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data , 2010, WWW '10.

[19]  Olivier Curé,et al.  WaterFowl: A Compact, Self-indexed and Inference-Enabled Immutable RDF Store , 2014, ESWC.

[20]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[21]  Nieves R. Brisaboa,et al.  Compressed vertical partitioning for efficient RDF management , 2014, Knowledge and Information Systems.

[22]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[23]  Gonzalo Navarro,et al.  Word-based self-indexes for natural language text , 2012, TOIS.