Assessing phylogenetic motif models for predicting transcription factor binding sites

Motivation: A variety of algorithms have been developed to predict transcription factor binding sites (TFBSs) within the genome by exploiting the evolutionary information implicit in multiple alignments of the genomes of related species. One such approach uses an extension of the standard position-specific motif model that incorporates phylogenetic information via a phylogenetic tree and a model of evolution. However, these phylogenetic motif models (PMMs) have never been rigorously benchmarked in order to determine whether they lead to better prediction of TFBSs than obtained using simple position weight matrix scanning. Results: We evaluate three PMM-based prediction algorithms, each of which uses a different treatment of gapped alignments, and we compare their prediction accuracy with that of a non-phylogenetic motif scanning approach. Surprisingly, all of these algorithms appear to be inferior to simple motif scanning, when accuracy is measured using a gold standard of validated yeast TFBSs. However, the PMM scanners perform much better than simple motif scanning when we abandon the gold standard and consider the number of statistically significant sites predicted, using column-shuffled ‘random’ motifs to measure significance. These results suggest that the common practice of measuring the accuracy of binding site predictors using collections of known sites may be dangerously misleading since such collections may be missing ‘weak’ sites, which are exactly the type of sites needed to discriminate among predictors. We then extend our previous theoretical model of the statistical power of PMM-based prediction algorithms to allow for loss of binding sites during evolution, and show that it gives a more accurate upper bound on scanner accuracy. Finally, utilizing our theoretical model, we introduce a new method for predicting the number of real binding sites in a genome. The results suggest that the number of true sites for a yeast TF is in general several times greater than the number of known sites listed in the Saccharomyces cerevisiae Database (SCPD). Among the three scanning algorithms that we test, the MONKEY algorithm has the highest accuracy for predicting yeast TFBSs. Contact: j.hawkins@imb.uq.edu.au

[1]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[2]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[3]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[4]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[5]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[6]  Shigeyuki Yokoyama,et al.  An unnatural base pair system for efficient PCR amplification and functionalization of DNA molecules , 2008, Nucleic acids research.

[7]  Giacomo Finocchiaro,et al.  Myc-binding-site recognition in the human genome is determined by chromatin context , 2006, Nature Cell Biology.

[8]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[9]  Alexander J. Hartemink,et al.  A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast , 2007, PLoS Comput. Biol..

[10]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[11]  A. Halpern,et al.  Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. , 1998, Molecular biology and evolution.

[12]  G. Rubin,et al.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[13]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[14]  John Hawkins,et al.  The Statistical Power of Phylogenetic Motif Models , 2008, RECOMB.

[15]  T. Bailey,et al.  High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites , 2008, Nucleic acids research.

[16]  Mark Gerstein,et al.  Divergence of transcription factor binding sites across related yeast species. , 2007, Science.

[17]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[18]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[19]  Manolis Kellis,et al.  Reliable prediction of regulator targets using 12 Drosophila genomes. , 2007, Genome research.

[20]  David A. Nix,et al.  Large-Scale Turnover of Functional Transcription Factor Binding Sites in Drosophila , 2006, PLoS Comput. Biol..

[21]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[22]  R. Staden Searching for patterns in protein and nucleic acid sequences. , 1990, Methods in enzymology.

[23]  E. Siggia,et al.  Analysis of Combinatorial cis-Regulation in Synthetic and Genomic Promoters , 2008, Nature.

[24]  Neil D Clarke,et al.  Whole-genome comparison of Leu3 binding in vitro and in vivo reveals the importance of nucleosome occupancy in target site selection. , 2006, Genome research.

[25]  R. Tjian,et al.  Transcription regulation and animal diversity , 2003, Nature.

[26]  Hao Li,et al.  The Evolution of Combinatorial Gene Regulation in Fungi , 2008, PLoS biology.

[27]  T. Kouzarides Chromatin Modifications and Their Function , 2007, Cell.

[28]  Scott Doniger,et al.  Frequent Gain and Loss of Functional Transcription Factor Binding Sites , 2007, PLoS Comput. Biol..

[29]  John D. Storey A direct approach to false discovery rates , 2002 .

[30]  D. Guhathakurta,et al.  Computational identification of transcriptional regulatory elements in DNA sequence , 2006, Nucleic acids research.

[31]  Alan M. Moses,et al.  MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model , 2004, Genome Biology.

[32]  S. Eddy A Model of the Statistical Power of Comparative Genome Sequence Analysis , 2005, PLoS biology.

[33]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[34]  T A Gray,et al.  Phylogenetic footprinting reveals a nuclear protein which binds to silencer sequences in the human gamma and epsilon globin genes , 1992, Molecular and cellular biology.