Use of ChIP-Seq data for the design of a multiple promoter-alignment method

We address the challenge of regulatory sequence alignment with a new method, Pro-Coffee, a multiple aligner specifically designed for homologous promoter regions. Pro-Coffee uses a dinucleotide substitution matrix estimated on alignments of functional binding sites from TRANSFAC. We designed a validation framework using several thousand families of orthologous promoters. This dataset was used to evaluate the accuracy for predicting true human orthologs among their paralogs. We found that whereas other methods achieve on average 73.5% accuracy, and 77.6% when trained on that same dataset, the figure goes up to 80.4% for Pro-Coffee. We then applied a novel validation procedure based on multi-species ChIP-seq data. Trained and untrained methods were tested for their capacity to correctly align experimentally detected binding sites. Whereas the average number of correctly aligned sites for two transcription factors is 284 for default methods and 316 for trained methods, Pro-Coffee achieves 331, 16.5% above the default average. We find a high correlation between a method's performance when classifying orthologs and its ability to correctly align proven binding sites. Not only has this interesting biological consequences, it also allows us to conclude that any method that is trained on the ortholog data set will result in functionally more informative alignments.

[1]  K. Lindblad-Toh,et al.  Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals , 2005, Nature.

[2]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[3]  J. Zeitlinger,et al.  High conservation of transcription factor binding and evidence for combinatorial regulation across six Drosophila species , 2011, Nature Genetics.

[4]  Enrique Blanco,et al.  Multiple non-collinear TF-map alignments of promoter regions , 2007, BMC Bioinformatics.

[5]  Christophe Dessimoz,et al.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods , 2009, PLoS Comput. Biol..

[6]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..

[7]  Peter F. Stadler,et al.  Measuring Transcription Factor–Binding Site Turnover: A Maximum Likelihood Approach Using Phylogenies , 2009, Genome biology and evolution.

[8]  I. Ovcharenko,et al.  Mulan: multiple-sequence alignment to predict functional elements in genomic sequences. , 2007, Methods in molecular biology.

[9]  Saurabh Sinha,et al.  Towards realistic benchmarks for multiple alignments of non-coding sequences , 2010, BMC Bioinform..

[10]  W. Miller,et al.  Mulan: multiple-sequence local alignment and visualization for studying function and evolution. , 2005, Genome research.

[11]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[12]  Sarah A. Teichmann,et al.  Assessing Computational Methods of Cis-Regulatory Module Prediction , 2010, PLoS Comput. Biol..

[13]  Matthias Zytnicki,et al.  BlastR—fast and accurate database searches for non-coding RNAs , 2011, Nucleic acids research.

[14]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[15]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[16]  Uwe Ohler,et al.  Phylogenetic simulation of promoter evolution: estimation and modeling of binding site turnover events and assessment of their impact on alignment tools , 2007, Genome Biology.

[17]  R. Siddharthan Dinucleotide Weight Matrices for Predicting Transcription Factor Binding Sites: Generalizing the Position Weight Matrix , 2010, PloS one.

[18]  Toby Johnson,et al.  MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution. , 2004, Genome research.

[19]  Gautier Koscielny,et al.  Ensembl’s 10th year , 2009, Nucleic Acids Res..

[20]  Sonja Althammer,et al.  Pyicos: a versatile toolkit for the analysis of high-throughput sequencing data , 2011, Bioinform..

[21]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[22]  P. Farnham Insights from genomic profiling of transcription factors , 2009, Nature Reviews Genetics.

[23]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[24]  Kazutaka Katoh,et al.  Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[25]  Enrique Blanco,et al.  Transcription Factor Map Alignment of Promoter Regions , 2006, PLoS Comput. Biol..

[26]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[27]  Jens Stoye,et al.  Benchmarking tools for the alignment of functional noncoding DNA , 2004, BMC Bioinformatics.

[28]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[29]  Michael D. Wilson,et al.  Five-Vertebrate ChIP-seq Reveals the Evolutionary Dynamics of Transcription Factor Binding , 2010, Science.

[30]  BMC Bioinformatics , 2005 .

[31]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[32]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[33]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Alan M. Moses,et al.  MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model , 2004, Genome Biology.

[35]  Sonja J. Prohaska,et al.  Surveying phylogenetic footprints in large gene clusters: applications to Hox cluster duplications. , 2004, Molecular phylogenetics and evolution.

[36]  Sing-Hoi Sze,et al.  Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues , 2008, Nucleic acids research.

[37]  J. Thompson,et al.  Issues in bioinformatics benchmarking: the case study of multiple sequence alignment , 2010, Nucleic acids research.

[38]  E. Ukkonen,et al.  Genome-wide Prediction of Mammalian Enhancers Based on Analysis of Transcription-Factor Binding Affinity , 2006, Cell.

[39]  Xin He,et al.  MORPH: Probabilistic Alignment Combined with Hidden Markov Models of cis-Regulatory Modules , 2007, PLoS Comput. Biol..

[40]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[41]  Uwe Ohler,et al.  Modeling the Evolution of Regulatory Elements by Simultaneous Detection and Alignment with Phylogenetic Pair HMMs , 2010, PLoS Comput. Biol..

[42]  David J. Arenillas,et al.  JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles , 2009, Nucleic Acids Res..

[43]  Stephen C. J. Parker,et al.  Local DNA Topography Correlates with Functional Noncoding Regions of the Human Genome , 2009, Science.

[44]  M. Vingron,et al.  Incorporating evolution of transcription factor binding sites into annotated alignments , 2007, Journal of Biosciences.

[45]  Hyrum Carroll,et al.  DNA reference alignment benchmarks based on tertiary structure of encoded proteins , 2007, Bioinform..

[46]  Eugene Berezikov,et al.  CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. , 2003, Genome research.

[47]  Sudhir Kumar,et al.  Multiple sequence alignment: in pursuit of homologous DNA positions. , 2007, Genome research.

[48]  Mikhail Pachkov,et al.  MotEvo: integrated Bayesian probabilistic methods for inferring regulatory sites and motifs on multiple alignments of DNA sequences , 2012, Bioinform..