Mining Frequent Patterns in Evolving Graphs

Given a labeled graph, the frequent-subgraph mining (FSM) problem asks to find all the k-vertex subgraphs that appear with frequency greater than a given threshold. FSM has numerous applications ranging from biology to network science, as it provides a compact summary of the characteristics of the graph. However, the task is challenging, even more so for evolving graphs due to the streaming nature of the input and the exponential time complexity of the problem. In this paper, we initiate the study of the approximate FSM problem in both incremental and fully-dynamic streaming settings, where arbitrary edges can be added or removed from the graph. For each streaming setting, we propose algorithms that can extract a high-quality approximation of the frequent k-vertex subgraphs for a given threshold, at any given time instance, with high probability. In contrast to the existing state-of-the-art solutions that require iterating over the entire set of subgraphs for any update, our algorithms operate by maintaining a uniform sample of k-vertex subgraphs with optimized neighborhood-exploration procedures local to the updates. We provide theoretical analysis of the proposed algorithms and empirically demonstrate that the proposed algorithms generate high-quality results compared to baselines.

[1]  Mohammad Al Hasan,et al.  GUISE: Uniform Sampling of Graphlets for Large Graph Analysis , 2012, 2012 IEEE 12th International Conference on Data Mining.

[2]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[3]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2004, IEEE International Parallel and Distributed Processing Symposium.

[4]  Sutanay Choudhury,et al.  Frequent Subgraph Discovery in Large Attributed Streaming Graphs , 2014, BigMine.

[5]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[6]  Hans-Peter Kriegel,et al.  Pattern Mining in Frequent Dynamic Subgraphs , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Donald F. Towsley,et al.  Minfer: A method of inferring motif statistics from sampled edges , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[8]  Donald F. Towsley,et al.  Efficiently Estimating Motif Statistics of Large Networks , 2013, TKDD.

[9]  Christian Böhm,et al.  Frequent subgraph discovery in dynamic networks , 2010, MLG '10.

[10]  Kun-Lung Wu,et al.  Counting and Sampling Triangles from a Graph Stream , 2013, Proc. VLDB Endow..

[11]  John C. S. Lui,et al.  A Unified Framework to Estimate Global and Local Graphlet Counts for Streaming Graphs , 2017, ASONAM.

[12]  Jeffrey Scott Vitter,et al.  Faster methods for random sampling , 1984, CACM.

[13]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[14]  Frans Coenen,et al.  A survey of frequent subgraph mining algorithms , 2012, The Knowledge Engineering Review.

[15]  Lorenzo De Stefani,et al.  TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size , 2016, KDD.

[16]  Mohammad Al Hasan,et al.  FS3: A sampling based method for top-k frequent subgraph mining , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[17]  Jiawei Han,et al.  gApprox: Mining Frequent Approximate Patterns from a Massive Network , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  Daniel Ting,et al.  Towards Optimal Cardinality Estimation of Unions and Intersections with Sketches , 2016, KDD.

[19]  Yongsub Lim,et al.  MASCOT: Memory-efficient and Accurate Sampling for Counting Local Triangles in Graph Streams , 2015, KDD.

[20]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21]  Peter J. Haas,et al.  A dip in the reservoir: maintaining sample synopses of evolving datasets , 2006, VLDB.

[22]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[23]  Matthieu Latapy,et al.  Main-memory triangle computations for very large (sparse (power-law)) graphs , 2008, Theor. Comput. Sci..

[24]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[25]  Peter J. Haas,et al.  Maintaining bounded-size sample synopses of evolving datasets , 2008, The VLDB Journal.

[26]  Mihail N. Kolountzakis,et al.  Triangle Sparsifiers , 2011, J. Graph Algorithms Appl..

[27]  Ali Pinar,et al.  A space efficient streaming algorithm for triangle counting using the birthday paradox , 2012, KDD.

[28]  Panos Kalnis,et al.  GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph , 2014, Proc. VLDB Endow..

[29]  George Karypis,et al.  GREW - a scalable frequent subgraph discovery algorithm , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[30]  Panos Kalnis,et al.  Incremental Frequent Subgraph Mining on Large Evolving Graphs , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[31]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[32]  Mohammad Al Hasan,et al.  ORIGAMI: A Novel and Effective Approach for Mining Representative Orthogonal Graph Patterns , 2008, Stat. Anal. Data Min..

[33]  Mohammad Al Hasan,et al.  Output Space Sampling for Graph Patterns , 2009, Proc. VLDB Endow..

[34]  Kun-Lung Wu,et al.  Towards proximity pattern mining in large graphs , 2010, SIGMOD Conference.

[35]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[36]  Geoff Holmes,et al.  Mining frequent closed graphs on evolving data streams , 2011, KDD.

[37]  Ravi Kumar,et al.  Counting Graphlets: Space vs Time , 2017, WSDM.

[38]  John C. S. Lui,et al.  A General Framework for Estimating Graphlet Statistics via Random Walk , 2016, Proc. VLDB Endow..

[39]  Jiangchuan Liu,et al.  Statistics and Social Network of YouTube Videos , 2008, 2008 16th Interntional Workshop on Quality of Service.

[40]  Harish Sethu,et al.  Waddling Random Walk: Fast and Accurate Mining of Motif Statistics in Large Graphs , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[41]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .