A fast weak motif-finding algorithm based on community detection in graphs

BackgroundIdentification of transcription factor binding sites (also called ‘motif discovery’) in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application.ResultsIn this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal.ConclusionsOur novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.

[1]  Mikhail S. Gelfand,et al.  A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length , 2005, Bioinform..

[2]  Sanguthevar Rajasekaran,et al.  A speedup technique for (l, d)-motif finding algorithms , 2011, BMC Research Notes.

[3]  T. D. Schneider,et al.  Information analysis of Fis binding sites. , 1997, Nucleic acids research.

[4]  J. W. Campbell,et al.  Escherichia coli FadR Positively Regulates Transcription of the fabB Fatty Acid Biosynthetic Gene , 2001, Journal of bacteriology.

[5]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[6]  A. Sharov,et al.  Exhaustive Search for Over-represented DNA Sequence Motifs with CisFinder , 2009, DNA research : an international journal for rapid publication of reports on genes and genomes.

[7]  Ying Xu,et al.  A new framework for identifying cis-regulatory motifs in prokaryotes , 2010, Nucleic acids research.

[8]  Wen-Jing Hsu,et al.  RecMotif: a novel fast algorithm for weak motif discovery , 2010, BMC Bioinformatics.

[9]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[10]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[11]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[12]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[13]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[14]  Mark A. McIntosh,et al.  Architecture of a Fur Binding Site: a Comparative Analysis , 2003, Journal of bacteriology.

[15]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[16]  Francis Y. L. Chin,et al.  Voting algorithms for discovering long motifs , 2005, APBC.

[17]  G. Church,et al.  A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. , 1998, Journal of molecular biology.

[18]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[19]  G D Stormo,et al.  A consensus sequence for binding of Lrp to DNA , 1995, Journal of bacteriology.

[20]  R. Redfield,et al.  CRP binding and transcription activation at CRP-S sites. , 2008, Journal of molecular biology.

[21]  Graziano Pesole,et al.  In silico representation and discovery of transcription factor binding sites , 2004, Briefings Bioinform..

[22]  Andreas Geyer-Schulz,et al.  Randomized Greedy Modularity Optimization for Group Detection in Huge Social Networks , 2010 .

[23]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[24]  J. Plumbridge,et al.  DNA binding sites for the Mlc and NagC proteins: regulation of nagE, encoding the N-acetylglucosamine-specific transporter in Escherichia coli. , 2001, Nucleic acids research.

[25]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[26]  S. Levy,et al.  MarA-mediated Transcriptional Repression of the rob Promoter* , 2006, Journal of Biological Chemistry.

[27]  G. Church,et al.  A motif co-occurrence approach for genome-wide prediction of transcription-factor-binding sites in Escherichia coli. , 2004, Genome research.

[28]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[29]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[30]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[31]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[32]  S. Kustu,et al.  Nitrogen Regulation in Salmonella typhimurium , 1980 .

[33]  K. Nikaido,et al.  Nitrogen regulation in Salmonella typhimurium. Identification of an ntrC protein‐binding site and definition of a consensus binding sequence. , 1985, The EMBO journal.

[34]  Steven J. M. Jones,et al.  Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. , 2006, Genome research.

[35]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[36]  Gang Li,et al.  A Cluster Refinement Algorithm for Motif Discovery , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  R. Gunsalus,et al.  Characterization of the ModE DNA‐binding sites in the control regions of modABCD and moaABCDE of Escherichia coli , 1997, Molecular microbiology.

[38]  Jaime I. Dávila,et al.  Fast and Practical Algorithms for Planted (l, d) Motif Search , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[39]  O Danot,et al.  On the puzzling arrangement of the asymmetric MalT-binding sites in the MalT-dependent promoters. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Eric S. Ho,et al.  iTriplet, a rule-based nucleic acid sequence motif finder , 2009, Algorithms for Molecular Biology.

[41]  D. F. Senear,et al.  Role of Multiple CytR Binding Sites on Cooperativity, Competition, and Induction at the Escherichia coli udpPromoter* , 1999, The Journal of Biological Chemistry.

[42]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[43]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[44]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[45]  Vladimir Pavlovic,et al.  Efficient motif finding algorithms for large-alphabet inputs , 2010, BMC Bioinformatics.

[46]  Sayan Mukherjee,et al.  Evidence-ranked motif identification , 2010, Genome Biology.

[47]  Julio Collado-Vides,et al.  RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12 , 2004, Nucleic Acids Res..

[48]  Sun-Yuan Hsieh,et al.  An Improved Heuristic Algorithm for Finding Motif Signals in DNA Sequences , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[49]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[50]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[51]  Amin Zia,et al.  Towards a theoretical understanding of false positives in DNA motif finding , 2010, BMC Bioinformatics.

[52]  Christina Boucher,et al.  Fast motif recognition via application of statistical thresholds , 2010, BMC Bioinformatics.

[53]  C Geourjon,et al.  Definition of a consensus DNA‐binding site for the Escherichia coli pleiotropic regulatory protein, FruR , 1996, Molecular microbiology.

[54]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[55]  A. Tramonti,et al.  GadX/GadW‐dependent regulation of the Escherichia coli acid fitness island: transcriptional control at the gadY–gadW divergent promoters and identification of four novel 42 bp GadX/GadW‐specific binding sites , 2008, Molecular microbiology.

[56]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[57]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[58]  Donghyuk Kim,et al.  The PurR regulon in Escherichia coli K-12 MG1655 , 2011, Nucleic acids research.

[59]  M. Schaechter,et al.  In vivo studies of DnaA binding to the origin of replication of Escherichia coli. , 1989, The EMBO journal.