Maximal Biclique Subgraphs and Closed Pattern Pairs of the Adjacency Matrix: A One-to-One Correspondence and Mining Algorithms

Maximal biclique (also known as complete bipartite) subgraphs can model many applications in Web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output can become exponentially large with respect to the vertex number when the graph grows. In this paper, we efficiently enumerate them through the use of closed patterns of the adjacency matrix of the graph. For an undirected graph G without self-loops, we prove that 1) the number of closed patterns in the adjacency matrix of G is even, 2) the number of the closed patterns is precisely double the number of maximal biclique subgraphs of G, and 3) for every maximal biclique subgraph, there always exists a unique pair of closed patterns that matches the two vertex sets of the subgraph. Therefore, the problem of enumerating maximal bicliques can be solved by using efficient algorithms for mining closed patterns, which are algorithms extensively studied in the data mining field. However, this direct use of existing algorithms causes a duplicated enumeration. To achieve high efficiency, we propose an O(mn) time delay algorithm for a nonduplicated enumeration, in particular, for enumerating those maximal bicliques with a large size, where m and n. are the number of edges and vertices of the graph, respectively. We evaluate the high efficiency of our algorithm by comparing it to state- of-the-art algorithms on three categories of graphs: randomly generated graphs, benchmarks, and a real-life protein interaction network. In this paper, we also prove that if self-loops are allowed in a graph, then the number of closed patterns in the adjacency matrix is not necessarily even, but the maximal bicliques are exactly the same as those of the graph after removing all the self-loops.

[1]  J. G. Burleigh,et al.  Prospects for Building the Tree of Life from Large Sequence Databases , 2004, Science.

[2]  Sergio Greco,et al.  Querying Graph Databases , 2000, EDBT.

[3]  K. Sneppen,et al.  Specificity and Stability in Topology of Protein Networks , 2002, Science.

[4]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[6]  Oliver Eulenstein,et al.  Obtaining maximal concatenated phylogenetic data sets from large sequence databases. , 2003, Molecular biology and evolution.

[7]  Peter L. Hammer,et al.  Consensus algorithms for the generation of all maximal bicliques , 2004, Discret. Appl. Math..

[8]  Celina M. H. de Figueiredo,et al.  Generating bicliques of a graph in lexicographic order , 2005, Theor. Comput. Sci..

[9]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[10]  Alberto O. Mendelzon,et al.  Finding Regular Simple Paths in Graph Databases , 1989, SIAM J. Comput..

[11]  Tsuyoshi Murata,et al.  Discovery of User Communities from Web Audience Measurement Data , 2003, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[12]  Gerd Stumme,et al.  Mining Minimal Non-redundant Association Rules Using Frequent Closed Itemsets , 2000, Computational Logic.

[13]  Jiawei Han,et al.  Mining Closed Relational Graphs with Connectivity Constraints , 2005, ICDE.

[14]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[15]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[16]  Chen Wang,et al.  Scalable mining of large disk-based graph databases , 2004, KDD.

[17]  Bart Goethals,et al.  FIMI'03: Workshop on Frequent Itemset Mining Implementations , 2003 .

[18]  Takashi Washio,et al.  Complete Mining of Frequent Patterns from Graphs: Mining Graph Data , 2003, Machine Learning.

[19]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[20]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[22]  Jinyan Li,et al.  Bioinformatics Original Paper Discovering Motif Pairs at Interaction Sites from Protein Sequences on a Proteome-wide Scale , 2022 .

[23]  J. G. Burleigh,et al.  Identifying optimal incomplete phylogenetic data sets from sequence databases. , 2005, Molecular phylogenetics and evolution.

[24]  M. Tyers,et al.  The GRID: The General Repository for Interaction Datasets , 2003, Genome Biology.

[25]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[26]  Wojciech Szpankowski,et al.  An efficient algorithm for detecting frequent subgraphs in biological networks , 2004, ISMB/ECCB.

[27]  Jiawei Han,et al.  Mining closed relational graphs with connectivity constraints , 2005, 21st International Conference on Data Engineering (ICDE'05).

[28]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[29]  David Eppstein,et al.  Arboricity and Bipartite Subgraph Listing Algorithms , 1994, Inf. Process. Lett..

[30]  Gerd Stumme,et al.  Conceptual Knowledge Discovery in Databases Using Formal Concept Analysis Methods , 1998, PKDD.

[31]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[32]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[33]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[34]  Kazuhisa Makino,et al.  New Algorithms for Enumerating All Maximal Cliques , 2004, SWAT.

[35]  Benno Schwikowski,et al.  Predicting protein-peptide interactions via a network-based motif sampler , 2004, ISMB/ECCB.

[36]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[37]  D. Bu,et al.  Topological structure analysis of the protein-protein interaction network in budding yeast. , 2003, Nucleic acids research.

[38]  Mohammed J. Zaki,et al.  Theoretical Foundations of Association Rules , 2007 .

[39]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[40]  George Karypis,et al.  GREW - a scalable frequent subgraph discovery algorithm , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[41]  Robert M. Haralick,et al.  Towards a Formal Concept Analysis Approach to Exploring Communities on the World Wide Web , 2005, ICFCA.

[42]  Gary D Bader,et al.  A Combined Experimental and Computational Strategy to Define Protein Interaction Networks for Peptide Recognition Modules , 2001, Science.

[43]  Hiroki Arimura,et al.  LCM ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets , 2004, FIMI.

[44]  Jian Pei,et al.  On mining cross-graph quasi-cliques , 2005, KDD '05.

[45]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[46]  Jiawei Han,et al.  Mining coherent dense subgraphs across massive biological networks for functional discovery , 2005, ISMB.

[47]  Gösta Grahne,et al.  Fast algorithms for frequent itemset mining using FP-trees , 2005, IEEE Transactions on Knowledge and Data Engineering.

[48]  Ron Rymon,et al.  Search through Systematic Set Enumeration , 1992, KR.