Large-scale bisimulation of RDF graphs

RDF datasets with billions of triples are no longer unusual and continue to grow constantly (e.g. LOD cloud) driven by the inherent flexibility of RDF that allows to represent very diverse datasets, ranging from highly structured to unstructured data. Because of their size, understanding and processing RDF graphs is often a difficult task and methods to reduce the size while keeping as much of its structural information become attractive. In this paper we study bisimulation as a means to reduce the size of RDF graphs according to structural equivalence. We study two bisimulation algorithms, one for sequential execution using SQL and one for distributed execution using MapReduce. We demonstrate that the MapReduce-based implementation scales linearly with the number of the RDF triples, allowing to compute the bisimulation of very large RDF graphs within a time which is by far not possible for the sequential version. Experiments based on synthetic benchmark data and real data (DBPedia) exhibit a reduction of more than 90% of the size of the RDF graph in terms of the number of nodes to the number of blocks in the resulting bisimulation partition.

[1]  Davide Sangiorgi,et al.  On the origins of bisimulation and coinduction , 2009, TOPL.

[2]  Jan Hidders,et al.  Bisimulation Reduction of Big Graphs on MapReduce , 2013, BNCOD.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[5]  Jeffrey F. Naughton,et al.  Covering indexes for branching path queries , 2002, SIGMOD '02.

[6]  Robin Milner,et al.  Communication and concurrency , 1989, PHI Series in computer science.

[7]  Scott A. Smolka,et al.  CCS expressions, finite state processes, and three problems of equivalence , 1983, PODC '83.

[8]  George H. L. Fletcher,et al.  Efficient external-memory bisimulation on DAGs , 2012, SIGMOD Conference.

[9]  Mariano P. Consens,et al.  ExpLOD: Summary-Based Exploration of Interlinking and RDF Usage in the Linked Open Data Cloud , 2010, ESWC.

[10]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[11]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[12]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[13]  Agostino Dovier,et al.  An efficient algorithm for computing bisimulation equivalence , 2004, Theor. Comput. Sci..

[14]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[15]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[16]  Simona Orzan,et al.  A distributed algorithm for strong bisimulation reduction of state spaces , 2004, International Journal on Software Tools for Technology Transfer.

[17]  Steffen Staab,et al.  SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data , 2012, J. Web Semant..

[18]  Octavian Udrea,et al.  Apples and oranges: a comparison of RDF benchmarks and real RDF datasets , 2011, SIGMOD '11.

[19]  Xin Wang,et al.  Query preserving graph compression , 2012, SIGMOD Conference.

[20]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[21]  Anas Alzogbi,et al.  Similar Structures inside RDF-Graphs , 2013, LDOW.

[22]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[23]  Sebastian Rudolph,et al.  Managing Structured and Semistructured RDF Data Using Structure Indexes , 2013, IEEE Transactions on Knowledge and Data Engineering.