High-performance parallel frequent subgraph discovery

Discovery of frequent subgraphs of an input network is one of the most important facilities for mining and analyzing complex networks. The most accurate solution to frequent subgraph discovery is to enumerate all subgraphs of size k and then count the frequency of each isomorphic class. However, the process is much time consuming because the number of subgraphs grows exponentially with the growth of the input network, or by increasing the size of the subgraphs. Also, there is no known polynomial-time algorithm for subgraph isomorphism detection, and this issue makes the problem harder. Hence, the available solutions can just mine small input networks and small subgraph sizes. A parallel and load-balanced solution named Subdigger is proposed which is faster and more efficient compared to available solutions. Subdigger efficiently executes on current multicore and multiprocessor machines, and incorporates a fast heuristic with a high-performance concurrent data structure which significantly accelerates detection and counting of isomorphic subgraphs. Subdigger can also handle large networks and subgraph sizes using external memory and external sorting. We performed several experiments using real-world input networks. Compared to the available solutions, Subdigger can extract frequent subgraphs much faster and the performance scales almost linearly using additional processor cores. The experimental results show that Subdigger can be more than 100 times faster than other solutions on a 4-core Intel i7 machine. Besides performance, Subdigger can process larger subgraphs using external memory while other tools crash due to memory limitation.

[1]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[2]  Ina Koch,et al.  QuateXelero: An Accelerated Exact Network Motif Detection Algorithm , 2013, PloS one.

[3]  Sebastian Wernicke,et al.  Efficient Detection of Network Motifs , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Fernando M. A. Silva,et al.  Efficient Parallel Subgraph Counting Using G-Tries , 2010, 2010 IEEE International Conference on Cluster Computing.

[5]  Falk Schreiber,et al.  Towards Motif Detection in Networks: Frequency Concepts and Flexible Search , 2004 .

[6]  Gang Wang,et al.  NetMODE: Network Motif Detection without Nauty , 2012, PloS one.

[7]  Brendan D. McKay,et al.  Practical graph isomorphism, II , 2013, J. Symb. Comput..

[8]  Petteri Kaski,et al.  Engineering an Efficient Canonical Labeling Tool for Large and Sparse Graphs , 2007, ALENEX.

[9]  Saeed Jalili,et al.  RANGI: A Fast List-Colored Graph Motif Finding Algorithm , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[11]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[12]  Long Ying,et al.  Topology structure and centrality in a java source code , 2012, 2012 IEEE International Conference on Granular Computing.

[13]  Pablo M. Gleiser,et al.  Community Structure in Jazz , 2003, Adv. Complex Syst..

[14]  Edward B. Suh,et al.  A parallel algorithm for extracting transcriptional regulatory network motifs , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[15]  László Babai,et al.  Canonical labeling of graphs , 1983, STOC.

[16]  David S. Johnson,et al.  The NP-completeness column , 2005, TALG.

[17]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[18]  Igor L. Markov,et al.  Conflict Anticipation in the Search for Graph Automorphisms , 2012, LPAR.

[19]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[20]  Saeed Shahrivari,et al.  Beyond Batch Processing: Towards Real-Time and Streaming Big Data , 2014, Comput..

[21]  S. Shen-Orr,et al.  Network motifs in the transcriptional regulation network of Escherichia coli , 2002, Nature Genetics.

[22]  Jeffrey D. Ullman,et al.  Enumerating subgraph instances using map-reduce , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[23]  Madhav V. Marathe,et al.  Subgraph Enumeration in Large Social Contact Networks Using Parallel Color Coding and Streaming , 2010, 2010 39th International Conference on Parallel Processing.

[24]  Fernando M. A. Silva,et al.  Parallel discovery of network motifs , 2012, J. Parallel Distributed Comput..

[25]  Sahar Asadi,et al.  Kavosh: a new algorithm for finding network motifs , 2009, BMC Bioinformatics.

[26]  Uri Alon,et al.  Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..

[27]  Madhav V. Marathe,et al.  SAHAD: Subgraph Analysis in Massive Networks Using Hadoop , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[28]  Elliott Cooper-Balis,et al.  Parallel Network Motif Finding , 2007 .

[29]  Sebastian Wernicke,et al.  FANMOD: a tool for fast network motif detection , 2006, Bioinform..

[30]  Zhao Zhao,et al.  Subgraph Querying in Relational Networks: A MapReduce Approach , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[31]  Xin Xu,et al.  Beyond random walk and metropolis-hastings samplers: why you should not backtrack for unbiased graph sampling , 2012, SIGMETRICS '12.

[32]  Minas Gjoka,et al.  Walking in Facebook: A Case Study of Unbiased Sampling of OSNs , 2010, 2010 Proceedings IEEE INFOCOM.

[33]  E. Palmer,et al.  THE ENUMERATION METHODS OF REDFIELD.1 , 1967 .

[34]  Fernando M. A. Silva,et al.  g-tries: an efficient data structure for discovering network motifs , 2010, SAC '10.

[35]  Mario Vento,et al.  An Improved Algorithm for Matching Large Graphs , 2001 .

[36]  Bin Wu,et al.  An Efficient Distributed Subgraph Mining Algorithm in Extreme Large Graphs , 2010, AICI.

[37]  A Díaz-Guilera,et al.  Self-similar community structure in a network of human interactions. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[38]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.

[39]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[40]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.