Counting frequent patterns in large labeled graphs: a hypergraph-based approach

In recent years, the popularity of graph databases has grown rapidly. This paper focuses on single-graph as an effective model to represent information and its related graph mining techniques. In frequent pattern mining in a single-graph setting, there are two main problems: support measure and search scheme. In this paper, we propose a novel framework for designing support measures that brings together existing minimum-image-based and overlap-graph-based support measures. Our framework is built on the concept of occurrence/instance hypergraphs. Based on such, we are able to design a series of new support measures: minimum instance (MI) measure, and minimum vertex cover (MVC) measure, that combine the advantages of existing measures. More importantly, we show that the existing minimum-image-based support measure is an upper bound of the MI measure, which is also linear-time computable and results in counts that are close to number of instances of a pattern. We show that not only most major existing support measures and new measures proposed in this paper can be mapped into the new framework, but also they occupy different locations of the frequency spectrum. By taking advantage of the new framework, we discover that MVC can be approximated to a constant factor (in terms of number of pattern nodes) in polynomial time. In contrast to common belief, we demonstrate that the state-of-the-art overlap-graph-based maximum independent set (MIS) measure also has constant approximation algorithms. We further show that using standard linear programming and semidefinite programming techniques, polynomial-time relaxations for both MVC and MIS measures can be developed and their counts stand between MVC and MIS. In addition, we point out that MVC, MIS, and their relaxations are bounded within constant factor. In summary, all major support measures are unified in the new hypergraph-based framework which helps reveal their bounding relations and hardness properties.

[1]  Panos Kalnis,et al.  GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph , 2014, Proc. VLDB Endow..

[2]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[3]  George Karypis,et al.  GREW - a scalable frequent subgraph discovery algorithm , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[4]  Mohammed J. Zaki,et al.  A distributed approach for graph mining in massive networks , 2016, Data Mining and Knowledge Discovery.

[5]  Jonas Holmerin,et al.  Improved Inapproximability Results for Vertex Cover on k -Uniform Hypergraphs , 2002, ICALP.

[6]  Marek Cygan,et al.  Improved Approximation for 3-Dimensional Matching via Bounded Pathwidth Local Search , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[7]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[8]  Yuk Hei Chan,et al.  On linear and semidefinite programming relaxations for hypergraph matching , 2010, Mathematical Programming.

[10]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[11]  Brendan D. McKay,et al.  Practical graph isomorphism, II , 2013, J. Symb. Comput..

[12]  Alexander Schrijver,et al.  On the Size of Systems of Sets Every t of Which Have an SDR, with an Application to the Worst-Case Ratio of Heuristics for Packing Problems , 1989, SIAM J. Discret. Math..

[13]  Takashi Washio,et al.  Complete Mining of Frequent Patterns from Graphs: Mining Graph Data , 2003, Machine Learning.

[14]  Christian Borgelt,et al.  Support Computation for Mining Frequent Subgraphs in a Single Graph , 2007, MLG.

[15]  Jan Ramon,et al.  An Efficiently Computable Support Measure for Frequent Subgraph Pattern Mining , 2012, ECML/PKDD.

[16]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[17]  Shang-Hua Teng,et al.  Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time , 2001, STOC '01.

[18]  Yi-Cheng Tu,et al.  Flexible and Feasible Support Measures for Mining Frequent Patterns in Large Labeled Graphs , 2017, SIGMOD Conference.

[19]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[20]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[21]  László Lovász,et al.  On the Shannon capacity of a graph , 1979, IEEE Trans. Inf. Theory.

[22]  Siegfried Nijssen,et al.  What Is Frequent in a Single Graph? , 2007, PAKDD.

[23]  Ehud Gudes,et al.  Computing frequent graph patterns from semistructured data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[24]  Wei Wang,et al.  An Efficient Algorithm of Frequent Connected Subgraph Extraction , 2003, PAKDD.

[25]  Jan Ramon,et al.  An efficiently computable subgraph pattern support measure: counting independent observations , 2013, Data Mining and Knowledge Discovery.

[26]  Toon Calders,et al.  Anti-monotonic Overlap-Graph Support Measures , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[27]  Zoltán Füredi,et al.  On the fractional matching polytope of a hypergraph , 1993, Comb..

[28]  Ehud Gudes,et al.  Support measures for graph data* , 2006, Data Mining and Knowledge Discovery.