Coupling graph perturbation theory with scalable parallel algorithms for large-scale enumeration of maximal cliques in biological graphs

Data-driven construction of predictive models for biological systems faces challenges from data intensity, uncertainty, and computational complexity. Data-driven model inference is often considered a combinatorial graph problem where an enumeration of all feasible models is sought. The data-intensive and the NP-hard nature of such problems, however, challenges existing methods to meet the required scale of data size and uncertainty, even on modern supercomputers. Maximal clique enumeration (MCE) in a graph derived from such biological data is often a rate-limiting step in detecting protein complexes in protein interaction data, finding clusters of co-expressed genes in microarray data, or identifying clusters of orthologous genes in protein sequence data. We report two key advances that address this challenge. We designed and implemented the first (to the best of our knowledge) parallel MCE algorithm that scales linearly on thousands of processors running MCE on real-world biological networks with thousands and hundreds of thousands of vertices. In addition, we proposed and developed the Graph Perturbation Theory (GPT) that establishes a foundation for efficiently solving the MCE problem in perturbed graphs, which model the uncertainty in the data. GPT formulates necessary and sufficient conditions for detecting the differences between the sets of maximal cliques in the original and perturbed graphs and reduces the enumeration time by more than 80% compared to complete recomputation.

[1]  Zohar Yakhini,et al.  Similarities and differences of gene expression in yeast stress conditions , 2007, Bioinform..

[2]  Thomas Linke,et al.  Visualizing plant metabolomic correlation networks using clique-metabolite matrices , 2001, Bioinform..

[3]  W. McDonald,et al.  MS2Grouper: Group assessment and synthetic replacement of duplicate proteomic tandem mass spectra , 2005, Journal of the American Society for Mass Spectrometry.

[4]  Nagiza F. Samatova,et al.  From pull-down data to protein interaction networks and complexes with biological relevance. , 2008, Bioinformatics.

[5]  Antal F. Novak,et al.  networks Græmlin : General and robust alignment of multiple large interaction data , 2006 .

[6]  Vipin Kumar,et al.  Scalable Load Balancing Techniques for Parallel Computers , 1994, J. Parallel Distributed Comput..

[7]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[8]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[9]  Yu Chen,et al.  A novel approach to structural alignment using realistic structural and environmental information , 2005, Protein science : a publication of the Protein Society.

[10]  Eric Harley,et al.  Uniform integration of genome mapping data using intersection graphs , 2001, Bioinform..

[11]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[12]  Nagiza F. Samatova,et al.  Genome-Scale Computational Approaches to Memory-Intensive Applications in Systems Biology , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[13]  P Willett,et al.  Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. , 1993, Journal of molecular biology.

[14]  Eugene L. Lawler,et al.  Generating all Maximal Independent Sets: NP-Hardness and Polynomial-Time Algorithms , 1980, SIAM J. Comput..

[15]  Udi Manber,et al.  DIB—a distributed implementation of backtracking , 1987, TOPL.