External memory K-bisimulation reduction of big graphs

In this paper, we present, to our knowledge, the first known I/O efficient solutions for computing the k-bisimulation partition of a massive directed graph, and performing maintenance of such a partition upon updates to the underlying graph. Ubiquitous in the theory and application of graph data, bisimulation is a robust notion of node equivalence which intuitively groups together nodes in a graph which share fundamental structural features. k-bisimulation is the standard variant of bisimulation where the topological features of nodes are only considered within a local neighborhood of radius k > 0. The I/O cost of our partition construction algorithm is bounded by O(k · sort}(|Et|) + k · scan(|Nt|) + sort(|Nt|)), while our maintenance algorithms are bounded by O(k · sort}(|Et|) + k · scan(|Nt|). The space complexity bounds are O(|Nt|+|Et|)$ and O(k · |Nt|+k ·|Et|), resp. Here, |Et| and |Nt| are the number of disk pages occupied by the input graph's edge set and node set, resp., and sort(n) and scan(n) are the cost of sorting and scanning, resp., a file occupying n pages in external memory. Empirical analysis on a variety of massive real-world and synthetic graph datasets shows that our algorithms perform efficiently in practice, scaling gracefully as graphs grow in size.

[1]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[2]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[3]  George H. L. Fletcher,et al.  Efficient external-memory bisimulation on DAGs , 2012, SIGMOD Conference.

[4]  Luca Aceto,et al.  Advanced Topics in Bisimulation and Coinduction , 2012, Cambridge tracts in theoretical computer science.

[5]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[6]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[7]  Roberto Grossi,et al.  On sorting strings in external memory (extended abstract) , 1997, STOC '97.

[8]  Jan Hidders,et al.  Bisimulation Reduction of Big Graphs on MapReduce , 2013, BNCOD.

[9]  Agostino Dovier,et al.  An efficient algorithm for computing bisimulation equivalence , 2004, Theor. Comput. Sci..

[10]  Jan Hidders,et al.  Regularities and dynamics in bisimulation reductions of big graphs , 2013, GRADES.

[11]  Peter Sanders,et al.  STXXL: standard template library for XXL data sets , 2008, Softw. Pract. Exp..

[12]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[13]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[14]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[15]  Carla Piazza,et al.  From Bisimulation to Simulation: Coarsest Partition Problems , 2003, Journal of Automated Reasoning.

[16]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[17]  Simona Orzan,et al.  Distributed state space minimization , 2004, International Journal on Software Tools for Technology Transfer.

[18]  Insup Lee,et al.  Parallel Algorithms for Relational Coarsest Partition Problems , 1998, IEEE Trans. Parallel Distributed Syst..

[19]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[20]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[21]  Georg Lausen,et al.  Large-scale bisimulation of RDF graphs , 2013, SWIM '13.

[22]  Robert E. Tarjan,et al.  Three Partition Refinement Algorithms , 1987, SIAM J. Comput..

[23]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[24]  George H. L. Fletcher,et al.  A methodology for coupling fragments of XPath with structural indexes for XML documents , 2007, Inf. Syst..

[25]  Jan Hidders,et al.  A Structural Approach to Indexing Triples , 2012, ESWC.

[26]  Mariano P. Consens,et al.  Linked Movie Data Base , 2009, LDOW.

[27]  Wenfei Fan,et al.  Graph pattern matching revised for social network analysis , 2012, ICDT '12.

[28]  Hao He,et al.  Incremental maintenance of XML structural indexes , 2004, SIGMOD '04.

[29]  Tim Berners-Lee,et al.  Linked data , 2020, Semantic Web for the Working Ontologist.

[30]  Peter Buneman,et al.  Edinburgh Research Explorer Path Queries on Compressed XML , 2022 .

[31]  Giuseppe Ottaviano,et al.  Fast Compressed Tries through Path Decompositions , 2011, ALENEX.

[32]  Xin Wang,et al.  Query preserving graph compression , 2012, SIGMOD Conference.

[33]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[34]  J. Vitter,et al.  On Sorting Strings in External Memory , 1997 .

[35]  Rizal Setya Perdana What is Twitter , 2013 .

[36]  Hao He,et al.  Multiresolution indexing of XML for frequent queries , 2004, Proceedings. 20th International Conference on Data Engineering.

[37]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.