Flexible exploration of large RDF datasets with unknown relationships can be enabled using 'unbound-property' graph pattern queries. Relational-style processing of such queries using normalized relations results in redundant information in intermediate results due to the repetition of adjoining bound (fixed) properties. Such redundancy negatively impacts the disk I/O, network transfer costs, and the required disk space while processing RDF query workloads on MapReduce-based systems. This work proposes packing and lazy unpacking strategies to minimize the redundancy in intermediate results while processing unbound-property queries. In addition to keeping the results compact, this work evaluates RDF queries using the Nested TripleGroup Data Model and Algebra (NTGA) that enables shorter MapReduce execution workflows. Experimental results demonstrate the benefit of this work over RDF query processing using relational-style systems such as Apache Pig and Hive.
[1]
HyeongSik Kim,et al.
From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra
,
2011,
Proc. VLDB Endow..
[2]
Jens Lehmann,et al.
DBpedia: A Nucleus for a Web of Open Data
,
2007,
ISWC/ASWC.
[3]
HyeongSik Kim,et al.
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce
,
2011,
ESWC.
[4]
HyeongSik Kim,et al.
To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing
,
2012,
SWIM '12.
[5]
Christian Bizer,et al.
The Berlin SPARQL Benchmark
,
2009,
Int. J. Semantic Web Inf. Syst..
[6]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.