The maximum clique enumeration problem: algorithms, applications, and implementations

BackgroundThe maximum clique enumeration (MCE) problem asks that we identify all maximum cliques in a finite, simple graph. MCE is closely related to two other well-known and widely-studied problems: the maximum clique optimization problem, which asks us to determine the size of a largest clique, and the maximal clique enumeration problem, which asks that we compile a listing of all maximal cliques. Naturally, these three problems are NP-hard, given that they subsume the classic version of the NP-complete clique decision problem. MCE can be solved in principle with standard enumeration methods due to Bron, Kerbosch, Kose and others. Unfortunately, these techniques are ill-suited to graphs encountered in our applications. We must solve MCE on instances deeply seeded in data mining and computational biology, where high-throughput data capture often creates graphs of extreme size and density. MCE can also be solved in principle using more modern algorithms based in part on vertex cover and the theory of fixed-parameter tractability (FPT). While FPT is an improvement, these algorithms too can fail to scale sufficiently well as the sizes and densities of our datasets grow.ResultsAn extensive testbed of benchmark graphs are created using publicly available transcriptomic datasets from the Gene Expression Omnibus (GEO). Empirical testing reveals crucial but latent features of such high-throughput biological data. In turn, it is shown that these features distinguish real data from random data intended to reproduce salient topological features. In particular, with real data there tends to be an unusually high degree of maximum clique overlap. Armed with this knowledge, novel decomposition strategies are tuned to the data and coupled with the best FPT MCE implementations.ConclusionsSeveral algorithmic improvements to MCE are made which progressively decrease the run time on graphs in the testbed. Frequently the final runtime improvement is several orders of magnitude. As a result, instances which were once prohibitively time-consuming to solve are brought into the domain of realistic feasibility.

[1]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[2]  Faisal N. Abu-Khzam,et al.  Scalable Parallel Algorithms for FPT Problems , 2006, Algorithmica.

[3]  Walter Willinger,et al.  Towards a Theory of Scale-Free Graphs: Definition, Properties, and Implications , 2005, Internet Math..

[4]  Michael A. Langston,et al.  Computational, Integrative, and Comparative Methods for the Elucidation of Genetic Coexpression Networks , 2005, Journal of biomedicine & biotechnology.

[5]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[6]  Robert W. Williams,et al.  Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function , 2005, Nature Genetics.

[7]  Michael A. Langston,et al.  Threshold selection in gene co-expression networks using spectral graph theory techniques , 2009, BMC Bioinformatics.

[8]  Béla Bollobás,et al.  Random Graphs , 1985 .

[9]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[10]  Michael A. Langston,et al.  Graph algorithms for integrated biological analysis, with applications to type 1 diabetes data , 2009 .

[11]  J. Moon,et al.  On cliques in graphs , 1965 .

[12]  Michael A. Langston,et al.  High performance computational tools for Motif discovery , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[13]  Michael R. Fellows,et al.  Parameterized Complexity , 1998 .

[14]  John David Eblen The M aximum Clique Problem: Algorithms, Applications, and Implementations , 2010 .

[15]  Etsuji Tomita,et al.  An Efficient Branch-and-bound Algorithm for Finding a Maximum Clique with Computational Experiments , 2001, J. Glob. Optim..

[16]  Arnold M Saxton,et al.  Comparison of threshold selection methods for microarray gene co-expression matrices , 2009, BMC Research Notes.

[17]  Rob Malouf,et al.  Maximal Consistent Subsets , 2007, Computational Linguistics.

[18]  Michael A. Langston,et al.  Combinatorial Genetic Regulatory Network Analysis Tools for High Throughput Transcriptomic Data , 2005, Systems Biology and Regulatory Genomics.

[19]  Michael A. Langston,et al.  The Maximum Clique Enumeration Problem: Algorithms, Applications and Implementations , 2011, ISBRA.

[20]  Akira Tanaka,et al.  The worst-case time complexity for generating all maximal cliques and computational experiments , 2006, Theor. Comput. Sci..

[21]  Andrea Torsello,et al.  A game-theoretic approach to partial clique enumeration , 2009, Image Vis. Comput..

[22]  Faisal N. Abu-Khzam,et al.  Using out-of-core techniques to produce exact solutions to the maximum clique problem on extremely large graphs , 2009, 2009 IEEE/ACS International Conference on Computer Systems and Applications.

[23]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[24]  Nagiza F. Samatova,et al.  Genome-Scale Computational Approaches to Memory-Intensive Applications in Systems Biology , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[25]  David Fernández-Baca,et al.  The Perfect Phylogeny Problem , 2001 .

[26]  J. Jeffry Howbert,et al.  The Maximum Clique Problem , 2007 .

[27]  Henning Fernau,et al.  On Parameterized Enumeration , 2002, COCOON.

[28]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[29]  Eric R. Harley Comparison of Clique-Listing Algorithms , 2004, MSV/AMCS.

[30]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[31]  Thomas Linke,et al.  Visualizing plant metabolomic correlation networks using clique-metabolite matrices , 2001, Bioinform..