PhyloScan: identification of transcription factor binding sites using cross-species evidence

BackgroundWhen transcription factor binding sites are known for a particular transcription factor, it is possible to construct a motif model that can be used to scan sequences for additional sites. However, few statistically significant sites are revealed when a transcription factor binding site motif model is used to scan a genome-scale database.MethodsWe have developed a scanning algorithm, PhyloScan, which combines evidence from matching sites found in orthologous data from several related species with evidence from multiple sites within an intergenic region, to better detect regulons. The orthologous sequence data may be multiply aligned, unaligned, or a combination of aligned and unaligned. In aligned data, PhyloScan statistically accounts for the phylogenetic dependence of the species contributing data to the alignment and, in unaligned data, the evidence for sites is combined assuming phylogenetic independence of the species. The statistical significance of the gene predictions is calculated directly, without employing training sets.ResultsIn a test of our methodology on synthetic data modeled on seven Enterobacteriales, four Vibrionales, and three Pasteurellales species, PhyloScan produces better sensitivity and specificity than MONKEY, an advanced scanning approach that also searches a genome for transcription factor binding sites using phylogenetic information. The application of the algorithm to real sequence data from seven Enterobacteriales species identifies novel Crp and PurR transcription factor binding sites, thus providing several new potential sites for these transcription factors. These sites enable targeted experimental validation and thus further delineation of the Crp and PurR regulons in E. coli.ConclusionBetter sensitivity and specificity can be achieved through a combination of (1) using mixed alignable and non-alignable sequence data and (2) combining evidence from multiple sites within an intergenic region.

[1]  Charles E Lawrence,et al.  Mammalian Genomes Ease Location of Human DNA Functional Segments but Not Their Description , 2004, Statistical applications in genetics and molecular biology.

[2]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[3]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  J. Neyman MOLECULAR STUDIES OF EVOLUTION: A SOURCE OF NOVEL STATISTICAL PROBLEMS* , 1971 .

[5]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[6]  Alan M. Moses,et al.  MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model , 2004, Genome Biology.

[7]  Jayashree Seshadri,et al.  PredictRegulon: a web server for the prediction of the regulatory protein binding sites and operons in prokaryote genomes , 2004, Nucleic Acids Res..

[8]  E. Koonin,et al.  Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes. , 1999, Nucleic acids research.

[9]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[10]  A. J. Valente,et al.  Multiple PU.1 sites cooperate in the regulation of p40(phox) transcription during granulocytic differentiation of myeloid cells. , 2002, Blood.

[11]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[12]  A A Mironov,et al.  Transcriptional regulation of transport and utilization systems for hexuronides, hexuronates and hexonates in gamma purple bacteria , 2000, Molecular microbiology.

[13]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[14]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[15]  G. Church,et al.  A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. , 1998, Journal of molecular biology.

[16]  JAN T. KIM,et al.  Binding Matrix: a Novel Approach for Binding Site Recognition , 2004, J. Bioinform. Comput. Biol..

[17]  A A Mironov,et al.  Comparative approach to analysis of regulation in complete genomes: multidrug resistance systems in gamma-proteobacteria. , 2001, Journal of molecular microbiology and biotechnology.

[18]  Michael Gribskov,et al.  Methods and Statistics for Combining Motif Match Scores , 1998, J. Comput. Biol..

[19]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[20]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[21]  A A Mironov,et al.  Regulation of aromatic amino acid biosynthesis in gamma-proteobacteria. , 2001, Journal of molecular microbiology and biotechnology.

[22]  Dmitry A Rodionov,et al.  Conservation of the biotin regulon and the BirA regulatory signal in Eubacteria and Archaea. , 2002, Genome research.

[23]  M S Gelfand,et al.  Computational analysis of the transcriptional regulation of pentose utilization systems in the gamma subdivision of Proteobacteria. , 2001, FEMS microbiology letters.

[24]  S. Jeffery Evolution of Protein Molecules , 1979 .

[25]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[26]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[27]  Z. Weng,et al.  Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. , 2002, Nucleic acids research.

[28]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[29]  C. Lawrence,et al.  Factors influencing the identification of transcription factor binding sites by cross-species comparison. , 2002, Genome research.

[30]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[31]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[32]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[33]  A A Mironov,et al.  Comparative analysis of FUR regulons in gamma-proteobacteria. , 2001, Nucleic acids research.

[34]  A. Halpern,et al.  Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. , 1998, Molecular biology and evolution.

[35]  Gary D. Stormo,et al.  MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices , 1995, Comput. Appl. Biosci..

[36]  Nikolaus Rajewsky,et al.  The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons. , 2002, Genome research.

[37]  David J Studholme,et al.  A DNA element recognised by the molybdenum-responsive transcription factor ModE is conserved in Proteobacteria, green sulphur bacteria and Archaea , 2003, BMC Microbiology.

[38]  G. Rubin,et al.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[40]  M. Demerec,et al.  Department of genetics. , 1951 .

[41]  G D Stormo,et al.  A comparative genomics approach to prediction of new members of regulons. , 2001, Genome research.

[42]  Julio Collado-Vides,et al.  RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12 , 2004, Nucleic Acids Res..

[43]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[44]  S. Gupta,et al.  Statistical decision theory and related topics IV , 1988 .

[45]  Mona Singh,et al.  Comparative analysis of methods for representing and searching for transcription factor binding sites , 2004, Bioinform..

[46]  S. S. Cairns,et al.  Transcriptional regulation of an archaeal operon in vivo and in vitro. , 1999, Molecular cell.

[47]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[48]  Dan S. Prestridge,et al.  SIGNAL SCAN 4.0: additional databases and sequence formats , 1996, Comput. Appl. Biosci..

[49]  A. F. Neuwald,et al.  Detecting patterns in protein sequences. , 1994, Journal of molecular biology.