Efficient techniques for subgraph mining and query processing

Graph data has been so prevalent that efficiently obtaining useful information from them is highly demanded. Given massive amounts of graph data, people are often interested in a small portion, namely their subgraphs, by the processes of mining and querying. Due to the enormous number of subgraphs in the massive graph data, these processes are highly costly. In this thesis, we study three important problems on subgraph mining and query processing, i.e., frequent subgraph mining, network motif discovery, and generalized subgraph query processing. These problems find numerous applications in real world, whereas they are extremely challenging. First, mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step filter-and-refinement approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are locally frequent in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the globally frequent graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing stateof-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs. Second, the identification of network motifs has essential applications in numerous domains, such as pattern detection in biological networks and graph analysis in digital circuits. However, mining network motifs is computationally challenging, as it requires to enumerate subgraphs from a real-life graph, and compute the frequency of each subgraph in a large number of random graphs. In particular, existing solutions often require days

[1]  Wilfred Ng,et al.  Fg-index: towards verification-free query processing on graph databases , 2007, SIGMOD '07.

[2]  Madhav V. Marathe,et al.  SAHAD: Subgraph Analysis in Massive Networks Using Hadoop , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[3]  Rajshekhar Sunderraman,et al.  An iterative MapReduce approach to frequent subgraph mining in biological datasets , 2012, BCB.

[4]  Lin Ma,et al.  Parallel subgraph listing in a large-scale graph , 2014, SIGMOD Conference.

[5]  Jianzhong Li,et al.  Graph homomorphism revisited for graph matching , 2010, Proc. VLDB Endow..

[6]  Jeffrey Xu Yu,et al.  Taming verification hardness: an efficient algorithm for testing subgraph isomorphism , 2008, Proc. VLDB Endow..

[7]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[8]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Tianyi David Han,et al.  Reducing branch divergence in GPU programs , 2011, GPGPU-4.

[11]  F. Schreiber,et al.  MODA: an efficient algorithm for network motif discovery in biological networks. , 2009, Genes & genetic systems.

[12]  Peter C. Jurs,et al.  Chemistry: The Molecular Science , 2001 .

[13]  Tianyu Wo,et al.  Distributed graph pattern matching , 2012, WWW.

[14]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[15]  Wei Jin,et al.  A Flexible Graph Pattern Matching Framework via Indexing , 2011, SSDBM.

[16]  M. Newman,et al.  On the uniform generation of random graphs with prescribed degree sequences , 2003, cond-mat/0312028.

[17]  Jianzhong Li,et al.  A novel approach for efficient supergraph query processing on graph databases , 2009, EDBT '09.

[18]  Falk Schreiber,et al.  Frequency Concepts and Pattern Detection for the Analysis of Motifs in Networks , 2005, Trans. Comp. Sys. Biology.

[19]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[20]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[21]  Yufei Tao,et al.  Minimal MapReduce algorithms , 2013, SIGMOD '13.

[22]  Bin Wu,et al.  An Efficient Distributed Subgraph Mining Algorithm in Extreme Large Graphs , 2010, AICI.

[23]  Jeffrey Xu Yu,et al.  iGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques , 2010, Proc. VLDB Endow..

[24]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[25]  Jeong-Hoon Lee,et al.  Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases , 2013, SIGMOD '13.

[26]  Lei Zou,et al.  A novel spectral coding in a large graph database , 2008, EDBT '08.

[27]  Christos Faloutsos,et al.  DOULION: counting triangles in massive graphs with a coin , 2009, KDD.

[28]  Philip S. Yu,et al.  Towards Graph Containment Search and Indexing , 2007, VLDB.

[29]  Fernando M. A. Silva,et al.  Parallel discovery of network motifs , 2012, J. Parallel Distributed Comput..

[30]  Jeffrey Xu Yu,et al.  Connected substructure similarity search , 2010, SIGMOD Conference.

[31]  Marcus Kaiser,et al.  Strategies for Network Motifs Discovery , 2009, 2009 Fifth IEEE International Conference on e-Science.

[32]  Maria E. Orlowska,et al.  Graph Mining based on a Data Partitioning Approach , 2008, ADC.

[33]  Houssain Kettani,et al.  On the Conversion Between Number Systems , 2006, IEEE Transactions on Circuits and Systems II: Express Briefs.

[34]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[35]  Fernando M. A. Silva,et al.  g-tries: an efficient data structure for discovering network motifs , 2010, SAC '10.

[36]  Yvonne C. Martin,et al.  ALADDIN: An integrated tool for computer-assisted molecular design and pharmacophore recognition from geometric, steric, and substructure searching of three-dimensional molecular structures , 1989, J. Comput. Aided Mol. Des..

[37]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.

[38]  Jianzhong Li,et al.  Adding regular expressions to graph reachability and pattern queries , 2011, ICDE 2011.

[39]  Jun-Lin Lin,et al.  Mining association rules: anti-skew algorithms , 1998, Proceedings 14th International Conference on Data Engineering.

[40]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[41]  R. Solé,et al.  Are network motifs the spandrels of cellular complexity? , 2006, Trends in ecology & evolution.

[42]  Stefan Kramer,et al.  Large-scale graph mining using backbone refinement classes , 2009, KDD.

[43]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[44]  Sebastian Wernicke,et al.  FANMOD: a tool for fast network motif detection , 2006, Bioinform..

[45]  Anthony K. H. Tung,et al.  Comparing Stars: On Approximating Graph Edit Distance , 2009, Proc. VLDB Endow..

[46]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[47]  Xuemin Lin,et al.  Efficient Graph Similarity Joins with Edit Distance Constraints , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[48]  Lawrence B. Holder,et al.  Approaches to Parallel Graph-Based Knowledge Discovery , 2001, J. Parallel Distributed Comput..

[49]  Ambuj K. Singh,et al.  GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[50]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[51]  Ashraf Aboulnaga,et al.  Scalable maximum clique computation using MapReduce , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[52]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[53]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[54]  Chen Wang,et al.  Scalable mining of large disk-based graph databases , 2004, KDD.

[55]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[56]  Wei Wang,et al.  Graph Database Indexing Using Structured Graph Decomposition , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[57]  Jiawei Han,et al.  A fast distributed algorithm for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[58]  Ambuj K. Singh,et al.  GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space , 2006, Sixth International Conference on Data Mining (ICDM'06).

[59]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[60]  Mario Vento,et al.  A (sub)graph isomorphism algorithm for matching large graphs , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Srinivasan Parthasarathy,et al.  Adaptive Parallel Graph Mining for CMP Architectures , 2006, Sixth International Conference on Data Mining (ICDM'06).

[62]  Ambuj K. Singh,et al.  Closure-Tree: An Index Structure for Graph Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[63]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[64]  Réka Albert,et al.  Conserved network motifs allow protein-protein interaction prediction , 2004, Bioinform..

[65]  Frans Coenen,et al.  A survey of frequent subgraph mining algorithms , 2012, The Knowledge Engineering Review.

[66]  Sahar Asadi,et al.  Kavosh: a new algorithm for finding network motifs , 2009, BMC Bioinformatics.

[67]  Ulf Assarsson,et al.  Fast parallel GPU-sorting using a hybrid algorithm , 2008, J. Parallel Distributed Comput..

[68]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[69]  Ana Paula Appel,et al.  HADI: Mining Radii of Large Graphs , 2011, TKDD.

[70]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[71]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[72]  Eric Li,et al.  Optimization of Frequent Itemset Mining on Multiple-Core Processor , 2007, VLDB.

[73]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[74]  Hui Xiong,et al.  Mining globally distributed frequent subgraphs in a single labeled graph , 2009, Data Knowl. Eng..

[75]  Jeffrey D. Ullman,et al.  Enumerating subgraph instances using map-reduce , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[76]  Uri Alon,et al.  Coarse-graining and self-dissimilarity of complex networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[77]  George Karypis,et al.  GREW - a scalable frequent subgraph discovery algorithm , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[78]  Ryutaro Ichise,et al.  Similarity search on supergraph containment , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[79]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[80]  O. Sporns,et al.  Motifs in Brain Networks , 2004, PLoS biology.

[81]  Jeffrey Xu Yu,et al.  Fast graph query processing with a low-cost index , 2011, The VLDB Journal.

[82]  Jianzhong Li,et al.  Efficient Subgraph Matching on Billion Node Graphs , 2012, Proc. VLDB Endow..

[83]  Uri Alon,et al.  Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..

[84]  Brendan D. McKay,et al.  Practical graph isomorphism, II , 2013, J. Symb. Comput..

[85]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[86]  Ina Koch,et al.  QuateXelero: An Accelerated Exact Network Motif Detection Algorithm , 2013, PloS one.

[87]  Sebastian Wernicke,et al.  Efficient Detection of Network Motifs , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[88]  Gang Wang,et al.  NetMODE: Network Motif Detection without Nauty , 2012, PloS one.

[89]  George Karypis,et al.  A Multi-Level Parallel Implementation of a Program for Finding Frequent Patterns in a Large Sparse Graph , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[90]  Klaus Berberich,et al.  Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[91]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[92]  Shijie Zhang,et al.  TreePi: A Novel Graph Indexing Method , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[93]  Edward B. Suh,et al.  A parallel algorithm for extracting transcriptional regulatory network motifs , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[94]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[95]  Noga Alon,et al.  Biomolecular network motif counting and discovery by color coding , 2008, ISMB.

[96]  Aristides Gionis,et al.  Mining Large Networks with Subgraph Counting , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[97]  Bingsheng He,et al.  Frequent itemset mining on graphics processors , 2009, DaMoN '09.