Comparative analysis of methods for representing and searching for transcription factor binding sites

MOTIVATION An important step in unravelling the transcriptional regulatory network of an organism is to identify, for each transcription factor, all of its DNA binding sites. Several approaches are commonly used in searching for a transcription factor's binding sites, including consensus sequences and position-specific scoring matrices. In addition, methods that compute the average number of nucleotide matches between a putative site and all known sites can be employed. Such basic approaches can all be naturally extended by incorporating pairwise nucleotide dependencies and per-position information content. In this paper, we evaluate the effectiveness of these basic approaches and their extensions in finding binding sites for a transcription factor of interest without erroneously identifying other genomic sequences. RESULTS In cross-validation testing on a dataset of Escherichia coli transcription factors and their binding sites, we show that there are statistically significant differences in how well various methods identify transcription factor binding sites. The use of per-position information content improves the performance of all basic approaches. Furthermore, including local pairwise nucleotide dependencies within binding site models results in statistically significant performance improvements for approaches based on nucleotide matches. Based on our analysis, the best results when searching for DNA binding sites of a particular transcription factor are obtained by methods that incorporate both information content and local pairwise correlations. AVAILABILITY The software is available at http://compbio.cs.princeton.edu/bindsites.

[1]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[2]  W. H. Day,et al.  Critical comparison of consensus methods for molecular sequences. , 1992, Nucleic acids research.

[3]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. II. The binding specificity of cyclic AMP receptor protein to recognition sites. , 1988, Journal of molecular biology.

[4]  G. Church,et al.  Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. , 2000, Genome research.

[5]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[6]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[7]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[8]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[9]  M. Blanchette,et al.  Discovery of regulatory elements by a computational method for phylogenetic footprinting. , 2002, Genome research.

[10]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[11]  E. Koonin,et al.  Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. , 2000, Nucleic acids research.

[12]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[13]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[14]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[16]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[17]  Robert L. Campbell,et al.  ESCHERICHIA COLI K-12* , 1973 .

[18]  G D Stormo,et al.  A comparative genomics approach to prediction of new members of regulons. , 2001, Genome research.

[19]  G. Church,et al.  A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. , 1998, Journal of molecular biology.

[20]  Saurabh Sinha,et al.  A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[21]  Denis Thieffry,et al.  Prediction of transcriptional regulatory sites in the complete genome sequence of Escherichia coli K-12 , 1998, Bioinform..

[22]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[23]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[24]  G. Stormo,et al.  Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. , 2001, Nucleic acids research.

[25]  G. Stormo,et al.  Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[26]  K. Yamauchi,et al.  The sequence flanking translational initiation site in protozoa. , 1991, Nucleic acids research.

[27]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[28]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[29]  M S Gelfand,et al.  Prediction of function in DNA sequence analysis. , 1995, Journal of computational biology : a journal of computational molecular cell biology.

[30]  Julio Collado-Vides,et al.  RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12 , 2004, Nucleic Acids Res..

[31]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..