Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions

BackgroundReliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered.ResultsTo improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-κB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies.To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA.Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies.ConclusionBased on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs.

[1]  Mathieu Blanchette,et al.  FootPrinter: a program designed for phylogenetic footprinting , 2003, Nucleic Acids Res..

[2]  G. Stormo,et al.  Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites , 2005, Nucleic acids research.

[3]  A. Vinogradov Noncoding DNA, isochores and gene expression: nucleosome formation potential , 2005, Nucleic acids research.

[4]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[5]  M. Q. Zhang,et al.  Identification of human gene core promoters in silico. , 1998, Genome research.

[6]  F. P. Roth,et al.  A non-parametric model for transcription factor binding sites. , 2003, Nucleic acids research.

[7]  Chung-Chin Lu,et al.  Prediction of splice sites with dependency graphs and their expanded bayesian networks , 2005, Bioinform..

[8]  A. Fornace,et al.  Presetting of chromatin structure and transcription factor binding poise the human GADD45 gene for rapid transcriptional up-regulation. , 1999, Nucleic acids research.

[9]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[10]  Victor G. Levitsky,et al.  Nucleosome formation potential of eukaryotic DNA: calculation and promoters analysis , 2001, Bioinform..

[11]  O. A. Podkolodnaya,et al.  Locus Control Regions: Description in the LCR-TRRDatabase , 2001, Molecular Biology.

[12]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[13]  J. Whitlock,et al.  Dioxin-induced CYP1A1 transcription in vivo: the aromatic hydrocarbon receptor mediates transactivation, enhancer-promoter communication, and changes in chromatin structure , 1996, Molecular and cellular biology.

[14]  O. Kohlbacher,et al.  From sequence to structure and back again: approaches for predicting protein-DNA binding , 2004, Proteome Science.

[15]  Walter R. Gilks,et al.  Studying statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the eukaryotic genomes , 2006, Briefings Bioinform..

[16]  T. V. Busygina,et al.  Consensus Sequence of Transcription Factor SF-1 Binding Site and Putative Binding Site in the 5'-Flanking Regions of Genes Encoding Mouse Steroidogenic Enzymes 3βHSDI and Cyp17 , 2003, Biochemistry (Moscow).

[17]  Simon Kasif,et al.  Modeling splice sites with Bayes networks , 2000, Bioinform..

[18]  D. S. Chekmenev,et al.  P-Match: transcription factor binding site search by combining patterns and weight matrices , 2005, Nucleic Acids Res..

[19]  V. Solovyev,et al.  Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. , 1994, Nucleic acids research.

[20]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[21]  G. Christian Overton,et al.  Oligonucleotide frequency matrices addressed to recognizing functional DNA sites , 1999, Bioinform..

[22]  W. Shen,et al.  Nuclear receptor steroidogenic factor 1 regulates the müllerian inhibiting substance gene: A link to the sex determination cascade , 1994, Cell.

[23]  Luciano Milanesi,et al.  Bioinformatics of Genome Regulation and Structure II , 2006 .

[24]  Feng Lin,et al.  In silico modelling of hormone response elements , 2006, BMC Bioinformatics.

[25]  T. V. Busygina,et al.  Potential binding sites for SF-1: Recognition by the SiteGA method, experimental verification, and search for new target genes , 2006, Molecular Biology.

[26]  Armin Shmilovici,et al.  Identification of transcription factor binding sites with variable-order Bayesian networks , 2005, Bioinform..

[27]  N. A. Kolchanov,et al.  Recognition of transcription factor binding sites by the SiteGA method , 2006 .

[28]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[29]  N. Perkins,et al.  Regulation of NF-kappaB function. , 2006, Biochemical Society symposium.

[30]  Dawn Field,et al.  Quantitative prediction of NF-κB DNA– protein interactions , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Michael Ruogu Zhang,et al.  Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors. , 2001, Journal of molecular biology.

[32]  B. Deroo,et al.  Glucocorticoid receptor activation of the I kappa B alpha promoter within chromatin. , 2001, Molecular biology of the cell.

[33]  L. Platanias Mechanisms of type-I- and type-II-interferon-mediated signalling , 2005, Nature Reviews Immunology.

[34]  H. Handa,et al.  A common trans-acting factor, Ad4-binding protein, to the promoters of steroidogenic P-450s. , 1992, The Journal of biological chemistry.

[35]  H R Drew,et al.  Principles of sequence-dependent flexure of DNA. , 1986, Journal of molecular biology.

[36]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[37]  P. Val,et al.  SF-1 a key player in the development and differentiation of steroidogenic tissues , 2003, Nuclear receptor.

[38]  Uwe Ohler,et al.  Optimized mixed Markov models for motif identification , 2006, BMC Bioinformatics.

[39]  Victor G. Levitsky,et al.  Nucleosome formation potential of exons, introns, and Alu repeats , 2001, Bioinform..

[40]  Guy Perrière,et al.  NUREBASE: database of nuclear hormone receptors , 2002, Nucleic Acids Res..

[41]  Robert Castelo,et al.  Splice site identification by idlBNs , 2004, ISMB/ECCB.

[42]  Qing Zhou,et al.  Modeling within-motif dependence for transcription factor binding site predictions , 2004, Bioinform..

[43]  Won Ho Kim,et al.  IFN-gamma/STAT1 acts as a proinflammatory signal in T cell-mediated hepatitis via induction of multiple chemokines and adhesion molecules: a critical role of IRF-1. , 2004, American journal of physiology. Gastrointestinal and liver physiology.

[44]  E. Trifonov,et al.  What positions nucleosomes? – A model , 2002, FEBS letters.

[45]  Tao Jiang,et al.  Identifying transcription factor binding sites through Markov chain optimization , 2002, ECCB.

[46]  Victor G. Levitsky,et al.  Nucleosomal DNA Organization: An Integrated Information System , 2004 .

[47]  Alexander E. Kel,et al.  Transcription Regulatory Regions Database (TRRD): its status in 1999 , 1999, Nucleic Acids Res..

[48]  Al Stutz,et al.  A draft annotation and overview of the human genome , 2001, Genome Biology.

[49]  G A Whitmore,et al.  A Statistical Model for Investigating Binding Probabilities of DNA Nucleotide Sequences Using Microarrays , 2002, Biometrics.

[50]  Rolf Backofen,et al.  A multiple-feature framework for modelling and predicting transcription factor binding sites , 2005, Bioinform..

[51]  N A Kolchanov,et al.  [Recognition of the potential SF-1 binding sites by SiteGA method, their experimental verification and search for new SF-1 target genes]. , 2006, Molekuliarnaia biologiia.

[52]  B. Deroo,et al.  Glucocorticoid Receptor Activation of the IκBα Promoter within Chromatin , 2001 .

[53]  Sitharthan Kamalakaran,et al.  Identification of Estrogen-responsive Genes Using a Genome-wide Analysis of Promoter Elements for Transcription Factor Binding Sites* , 2005, Journal of Biological Chemistry.

[54]  M. Beato,et al.  Transcription factor access to chromatin. , 1997, Nucleic acids research.

[55]  Andrey N. Naumochkin,et al.  Transcription Regulatory Regions Database (TRRD): its status in 2002 , 2002, Nucleic Acids Res..

[56]  Thomas B Kepler,et al.  Identification and utilization of arbitrary correlations in models of recombination signal sequences , 2002, Genome Biology.

[57]  D. Guhathakurta,et al.  Computational identification of transcriptional regulatory elements in DNA sequence , 2006, Nucleic acids research.

[58]  G. Stormo,et al.  Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. , 2001, Nucleic acids research.

[59]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[60]  G. Stormo,et al.  Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[61]  Won-Ho Kim,et al.  IFN-gamma/STAT1 acts as a proinflammatory signal in T cell-mediated hepatitis via induction of multiple chemokines and adhesion molecules: a critical role of IRF-1. , 2004, American journal of physiology. Gastrointestinal and liver physiology.

[62]  G. Christian Overton,et al.  Conformational and physicochemical DNA features specific for transcription factor binding sites , 1999, Bioinform..

[63]  E A Anan'ko,et al.  [Method SiteGA for the recognition of transcription factor binding sites]. , 2006, Biofizika.

[64]  Marc Fellous,et al.  The human Y chromosome: the biological role of a “functional wasteland” , 2001, Journal of biomedicine & biotechnology.

[65]  Alexander E. Kel,et al.  TRANSCompel®: a database on composite regulatory elements in eukaryotic genes , 2002, Nucleic Acids Res..

[66]  Michael Ruogu Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2002, Nature Genetics.

[67]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[68]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[69]  Victor G. Levitsky,et al.  The Sitega Tool for Recognition and Context Analysis of Transcription Factor Binding Sites: Significant Dinucleotide Features Besides the Canonical Consensus Exemplified By SF-1 Binding Site , 2006 .

[70]  Johan Auwerx,et al.  LRH-1: an orphan nuclear receptor involved in development, metabolism and steroidogenesis. , 2004, Trends in cell biology.

[71]  Kaushal Kumar,et al.  Comparative analysis of chromatin landscape in regulatory regions of human housekeeping and tissue specific genes , 2005, BMC Bioinformatics.

[72]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[73]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[74]  K. Lindblad-Toh,et al.  Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals , 2005, Nature.

[75]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[76]  Victor G. Levitsky,et al.  RECON: a program for prediction of nucleosome formation potential , 2004, Nucleic Acids Res..

[77]  Victor G. Levitsky,et al.  Recognition of eukaryotic promoters using a genetic algorithm based on iterative discriminant analysis , 2003, Silico Biol..

[78]  T. Mak,et al.  Roles of interferon-regulatory factors in T-helper-cell differentiation , 2005, Nature Reviews Immunology.

[79]  Philipp Bucher,et al.  EPD in its twentieth year: towards complete promoter coverage of selected model organisms , 2005, Nucleic Acids Res..

[80]  Mona Singh,et al.  Comparative analysis of methods for representing and searching for transcription factor binding sites , 2004, Bioinform..

[81]  Giorgio Bernardi,et al.  An isochore map of human chromosomes. , 2006, Genome research.