HDT-MR: A Scalable Solution for RDF Compression with HDT and MapReduce

HDT a is binary RDF serialization aiming at minimizing the space overheads of traditional RDF formats, while providing retrieval features in compressed space. Several HDT-based applications, such as the recent Linked Data Fragments proposal, leverage these features for diverse publication, interchange and consumption purposes. However, scalability issues emerge in HDT construction because the whole RDF dataset must be processed in a memory-consuming task. This is hindering the evolution of novel applications and techniques at Web scale. This paper introduces HDT-MR, a MapReduce-based technique to process huge RDF and build the HDT serialization. HDT-MR performs in linear time with the dataset size and has proven able to serialize datasets upi¾?to several billion triples, preserving HDT compression and retrieval features.

[1]  Óscar Corcho,et al.  HDTourist: Exploring Urban Data on Android , 2014, International Semantic Web Conference.

[2]  Olivier Curé,et al.  WaterFowl: A Compact, Self-indexed and Inference-Enabled Immutable RDF Store , 2014, ESWC.

[3]  Rik Van de Walle,et al.  Querying Datasets on the Web with High Availability , 2014, SEMWEB.

[4]  Ole Karlsson,et al.  [Do you want to know more?]. , 2014, Theriaca.

[5]  Conor Hayes,et al.  SemStim at the LOD-RecSys 2014 Challenge , 2014, SemWebEval@ESWC.

[6]  Miguel A. Martínez-Prieto,et al.  Exchange and Consumption of Huge RDF Data , 2012, ESWC.

[7]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[8]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[9]  Miguel A. Martínez-Prieto,et al.  Querying RDF dictionaries in compressed space , 2012, SIAP.

[10]  Pascal Hitzler,et al.  Logical Linked Data Compression , 2013, ESWC.

[11]  Axel Polleres,et al.  Binary RDF representation for publication and exchange (HDT) , 2013, J. Web Semant..

[12]  Nieves R. Brisaboa,et al.  Compressed String Dictionaries , 2011, SEA.

[13]  Jacopo Urbani,et al.  Scalable RDF data compression with MapReduce , 2013, Concurr. Comput. Pract. Exp..

[14]  Spyros Kotoulas,et al.  Efficient Parallel Dictionary Encoding for RDF Data. , 2014 .

[15]  Nieves R. Brisaboa,et al.  Compressed vertical partitioning for efficient RDF management , 2014, Knowledge and Information Systems.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Nandan Mirajkar,et al.  Perform wordcount Map-Reduce Job in Single Node Apache Hadoop cluster and compress data using Lempel-Ziv-Oberhumer (LZO) algorithm , 2013, ArXiv.

[18]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..