Parallel subgraph listing in a large-scale graph

Subgraph listing is a fundamental operation to many graph and network analyses. The problem itself is computationally expensive and is well-studied in centralized processing algorithms. However, the centralized solutions cannot scale well to large graphs. Recently, several parallel approaches are introduced to handle the large graphs. Unfortunately, these parallel approaches still rely on the expensive join operations, thus cannot achieve high performance. In this paper, we design a novel parallel subgraph listing framework, named PSgL. The PSgL iteratively enumerates subgraph instances and solves the subgraph listing in a divide-and-conquer fashion. The framework completely relies on the graph traversal, and avoids the explicit join operation. Moreover, in order to improve its performance, we propose several solutions to balance the workload and reduce the size of intermediate results. Specially, we prove the problem of partial subgraph instance distribution for workload balance is NP-hard, and carefully design a set of heuristic strategies. To further reduce the enormous intermediate results, we introduce three independent mechanisms, which are automorphism breaking of the pattern graph, initial pattern vertex selection based on a cost model, and a pruning method based on a light-weight index. We have implemented the prototype of PSgL, and run comprehensive experiments of various graph listing operations on diverse large graphs. The experiments clearly demonstrate that PSgL is robust and can achieve performance gain over the state-of-the-art solutions up to 90%.

[1]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[2]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[3]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[4]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[5]  Maurice Herlihy,et al.  Hopscotch Hashing , 2008, DISC.

[6]  Yufei Tao,et al.  Massive graph triangulation , 2013, SIGMOD '13.

[7]  Sebastian Wernicke,et al.  Efficient Detection of Network Motifs , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[9]  Michael Mikolajczak,et al.  Designing And Building Parallel Programs: Concepts And Tools For Parallel Software Engineering , 1997, IEEE Concurrency.

[10]  Anna Lubiw,et al.  Some NP-Complete Problems Similar to Graph Isomorphism , 1981, SIAM J. Comput..

[11]  Joseph A. Gallian,et al.  A Dynamic Survey of Graph Labeling , 2009, The Electronic Journal of Combinatorics.

[12]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[13]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[14]  Jeffrey Xu Yu,et al.  TreeSpan: efficiently computing similarity all-matching , 2012, SIGMOD Conference.

[15]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.

[16]  Jianzhong Li,et al.  Efficient Subgraph Matching on Billion Node Graphs , 2012, Proc. VLDB Endow..

[17]  Jeffrey D. Ullman,et al.  Enumerating subgraph instances using map-reduce , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[18]  Oscar H. Ibarra,et al.  Heuristic Algorithms for Scheduling Independent Tasks on Nonidentical Processors , 1977, JACM.

[19]  Jure Leskovec,et al.  Patterns of Influence in a Recommendation Network , 2006, PAKDD.

[20]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[21]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[22]  Norishige Chiba,et al.  Arboricity and Subgraph Listing Algorithms , 1985, SIAM J. Comput..

[23]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[26]  Aristides Gionis,et al.  Mining Large Networks with Subgraph Counting , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[27]  Thomas Bäck,et al.  Evolutionary algorithms in theory and practice - evolution strategies, evolutionary programming, genetic algorithms , 1996 .

[28]  Jiawei Han,et al.  On graph query optimization in large networks , 2010, Proc. VLDB Endow..

[29]  Todd Plantenga,et al.  Inexact subgraph isomorphism in MapReduce , 2013, J. Parallel Distributed Comput..

[30]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.

[31]  James Cheng,et al.  Triangle listing in massive networks and its applications , 2011, KDD.

[32]  Madhav V. Marathe,et al.  Subgraph Enumeration in Large Social Contact Networks Using Parallel Color Coding and Streaming , 2010, 2010 39th International Conference on Parallel Processing.