Efficient Algorithms for Mining Significant Substructures in Graphs with Quality Guarantees

Graphs have become popular for modeling scientific data in recent years. As a result, techniques for mining graphs are extremely important for understanding inherent data and domain characteristics. One such exploratory mining paradigm is the k-MST (minimum spanning tree over k vertices) problem that can be used to discover significant local substructures. In this paper, we present an efficient approximation algorithm for the k-MST problem in large graphs. The algorithm has an O(radic/k) approximation ratio and O(n log n + in log m log k + nk2 log k) running time, where n and m are the number of vertices and edges respectively. Experimental results on synthetic graphs and protein interaction networks show that the algorithm is scalable to large graphs and useful for discovering biological pathways. The highlight of the algorithm is that it offers both analytical guarantees and empirical evidence of good running time and quality.

[1]  David P. Williamson,et al.  A general approximation technique for constrained forest problems , 1992, SODA '92.

[2]  Chuan Yi Tang,et al.  A 2.|E|-Bit Distributed Algorithm for the Directed Euler Trail Problem , 1993, Inf. Process. Lett..

[3]  R. Ravi,et al.  Spanning trees short or small , 1994, SODA '94.

[4]  Santosh S. Vempala,et al.  Improved approximation guarantees for minimum-weight k-trees and prize-collecting salesmen , 1995, STOC '95.

[5]  Naveen Garg,et al.  A 3-approximation for the minimum tree spanning k vertices , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[6]  Sunil Arya,et al.  A 2.5-Factor Approximation Algorithm for the k-MST Problem , 1998, Inf. Process. Lett..

[7]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[8]  Santosh S. Vempala,et al.  A Constant-Factor Approximation Algorithm for the k-MST Problem , 1999, J. Comput. Syst. Sci..

[9]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[10]  S. Dongen Graph clustering by flow simulation , 2000 .

[11]  Sanjeev Arora,et al.  A 2+epsilon approximation algorithm for the k-MST problem , 2000, SODA.

[12]  T. Hughes,et al.  Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. , 2000, Science.

[13]  Sanjeev Arora,et al.  A 2 + ɛ approximation algorithm for the k-MST problem , 2000, SODA '00.

[14]  Mohammed J. Zaki,et al.  Mining Protein Contact Maps , 2002, BIOKDD.

[15]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[16]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[17]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[18]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[19]  Yun Chi,et al.  Indexing and mining free trees , 2003, Third IEEE International Conference on Data Mining.

[20]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[21]  Shailesh V. Date,et al.  A Probabilistic Functional Network of Yeast Genes , 2004, Science.

[22]  Christos Faloutsos,et al.  Fast discovery of connection subgraphs , 2004, KDD.

[23]  Srinivasan Parthasarathy,et al.  Discovering frequent topological structures from graph datasets , 2005, KDD '05.

[24]  Naveen Garg,et al.  Saving an epsilon: a 2-approximation for the k-MST problem in graphs , 2005, STOC '05.

[25]  Jian Pei,et al.  Mining cross-graph quasi-cliques in gene expression and protein interaction data , 2005, 21st International Conference on Data Engineering (ICDE'05).

[26]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[27]  Jung-Hwan Oh,et al.  STRG-Index: spatio-temporal region graph indexing for large video databases , 2005, SIGMOD '05.

[28]  Ronald Fagin,et al.  Multi-structural databases , 2005, PODS '05.

[29]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[30]  Mong-Li Lee,et al.  NeMoFinder: dissecting genome-wide protein-protein interactions with meso-scale network motifs , 2006, KDD '06.

[31]  Ravi Kumar,et al.  Hierarchical topic segmentation of websites , 2006, KDD '06.

[32]  Yehuda Koren,et al.  Measuring and extracting proximity in networks , 2006, KDD '06.

[33]  Roded Sharan,et al.  Efficient Algorithms for Detecting Signaling Pathways in Protein Interaction Networks , 2006, J. Comput. Biol..