A distributed approach for graph mining in massive networks

We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph, ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.

[1]  Siegfried Nijssen,et al.  What Is Frequent in a Single Graph? , 2007, PAKDD.

[2]  Thorsten Meinl,et al.  Mining Molecular Datasets on Symmetric Multiprocessor Systems , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[3]  Jianzhong Li,et al.  Efficient Subgraph Matching on Billion Node Graphs , 2012, Proc. VLDB Endow..

[4]  George Karypis,et al.  A Multi-Level Parallel Implementation of a Program for Finding Frequent Patterns in a Large Sparse Graph , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Huajun Chen,et al.  MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network , 2009, APPT.

[6]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[7]  Panos Kalnis,et al.  GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph , 2014, Proc. VLDB Endow..

[8]  Srinivasan Parthasarathy,et al.  Improving Functional Modularity in Protein-Protein Interactions Graphs Using Hub-Induced Subgraphs , 2006, PKDD.

[9]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[10]  Lawrence B. Holder,et al.  Discovery of Inexact Concepts from Structural Data , 1993, IEEE Trans. Knowl. Data Eng..

[11]  Srinivasan Parthasarathy,et al.  Adaptive Parallel Graph Mining for CMP Architectures , 2006, Sixth International Conference on Data Mining (ICDM'06).

[12]  Saeed Jalili,et al.  Distributed discovery of frequent subgraphs of a network using MapReduce , 2015, Computing.

[13]  Rajshekhar Sunderraman,et al.  An iterative MapReduce approach to frequent subgraph mining in biological datasets , 2012, BCB.

[14]  Lin Ma,et al.  Parallel subgraph listing in a large-scale graph , 2014, SIGMOD Conference.

[15]  Jeffrey D. Ullman,et al.  Enumerating subgraph instances using map-reduce , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[16]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[17]  Giuseppe Di Fatta,et al.  Dynamic Load Balancing for the Distributed Mining of Molecular Structures , 2006, IEEE Transactions on Parallel and Distributed Systems.

[18]  Mohammed J. Zaki,et al.  Arabesque: a system for distributed graph mining , 2015, SOSP.

[19]  Anthony K. H. Tung,et al.  Efficiently extracting frequent subgraphs using MapReduce , 2013, 2013 IEEE International Conference on Big Data.

[20]  Guizhen Yang,et al.  The complexity of mining maximal frequent itemsets and maximal frequent patterns , 2004, KDD.

[21]  Phokion G. Kolaitis,et al.  The complexity of mining maximal frequent subgraphs , 2013, PODS '13.

[22]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[23]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[24]  Xiaokui Xiao,et al.  Large-scale frequent subgraph mining in MapReduce , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[25]  Mohammed J. Zaki,et al.  Parallel Graph Mining with GPUs , 2014, BigMine.

[26]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2004, IEEE International Parallel and Distributed Processing Symposium.

[27]  Mohammad Al Hasan,et al.  An Iterative MapReduce Based Frequent Subgraph Mining Algorithm , 2013, IEEE Transactions on Knowledge and Data Engineering.

[28]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[29]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[30]  Bin Wu,et al.  An Efficient Distributed Subgraph Mining Algorithm in Extreme Large Graphs , 2010, AICI.