SAHAD: Subgraph Analysis in Massive Networks Using Hadoop

Relational sub graph analysis, e.g. finding labeled sub graphs in a network, which are isomorphic to a template, is a key problem in many graph related applications. It is computationally challenging for large networks and complex templates. In this paper, we develop SAHAD, an algorithm for relational sub graph analysis using Hadoop, in which the sub graph is in the form of a tree. SAHAD is able to solve a variety of problems closely related with sub graph isomorphism, including counting labeled/unlabeled sub graphs, finding supervised motifs, and computing graph let frequency distribution. We prove that the worst case work complexity for SAHAD is asymptotically very close to that of the best sequential algorithm. On a mid-size cluster with about 40 compute nodes, SAHAD scales to networks with up to 9 million nodes and a quarter billion edges, and templates with up to 12 nodes. To the best of our knowledge, SAHAD is the first such Hadoop based subgraph/subtree analysis algorithm, and performs significantly better than prior approaches for very large graphs and templates. Another unique aspect is that SAHAD is also amenable to running quite easily on Amazon EC2, without needs for any system level optimization.

[1]  Mam Riess Jones Color Coding , 1962, Human factors.

[2]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[3]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[4]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[5]  Sriram Raghavan,et al.  Representing Web graphs , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[6]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[7]  Friedrich Eisenbrand,et al.  On the complexity of fixed parameter clique and dominating set , 2004, Theor. Comput. Sci..

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Lise Getoor,et al.  Link mining: a survey , 2005, SKDD.

[10]  E. Bloedorn,et al.  Relational Graph Analysis with Real-World Constraints : An Application in IRS Tax Fraud Detection , 2005 .

[11]  George Karypis,et al.  Frequent substructure-based approaches for classifying chemical compounds , 2003, IEEE Transactions on Knowledge and Data Engineering.

[12]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[13]  Jiawei Han,et al.  Mining closed relational graphs with connectivity constraints , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14]  Jure Leskovec,et al.  Patterns of Influence in a Recommendation Network , 2006, PAKDD.

[15]  Thomas Zichner,et al.  Algorithm Engineering for Color-Coding with Applications to Signaling Pathway Detection , 2008, Algorithmica.

[16]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.

[17]  Natasa Przulj,et al.  Biological network comparison using graphlet degree distribution , 2007, Bioinform..

[18]  Jianyong Wang,et al.  Out-of-core coherent closed quasi-clique mining from large dense graph databases , 2007, TODS.

[19]  Noga Alon,et al.  Biomolecular network motif counting and discovery by color coding , 2008, ISMB.

[20]  Madhav V. Marathe,et al.  Generation and analysis of large synthetic social contact networks , 2009, Proceedings of the 2009 Winter Simulation Conference (WSC).

[21]  Huajun Chen,et al.  MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network , 2009, APPT.

[22]  Sherif Sakr,et al.  GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries , 2009, DASFAA.

[23]  Ryan Williams,et al.  Finding, minimizing, and counting weighted subgraphs , 2009, STOC '09.

[24]  Oded Shmueli,et al.  Evaluating very large datalog queries on social networks , 2009, EDBT '09.

[25]  Christos Faloutsos,et al.  DOULION: counting triangles in massive graphs with a coin , 2009, KDD.

[26]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[27]  Madhav V. Marathe,et al.  Subgraph Enumeration in Large Social Contact Networks Using Parallel Color Coding and Streaming , 2010, 2010 39th International Conference on Parallel Processing.

[28]  Dana Ron,et al.  Counting stars and other small subgraphs in sublinear time , 2010, SODA '10.

[29]  V. S. Subrahmanian,et al.  COSI: Cloud Oriented Subgraph Identification in Massive Social Networks , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[30]  Andrzej Lingas,et al.  Counting and detecting small subgraphs via equations and matrix multiplication , 2011, SODA '11.

[31]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[32]  Charalampos E. Tsourakakis,et al.  Colorful triangle counting and a MapReduce implementation , 2011, Inf. Process. Lett..