Improved prediction of transcription binding sites from chromatin modification data

In this paper we apply machine learning to the task of predicting transcription factor binding sites by combining information on multiple forms of chromatin modification with the binding strength DNA site predicted by a position weight matrix. We additionally explore the effect of incorporating auxiliary features such as the distance of the site to the nearest gene's transcription start site and the degree to which the site is conserved among related species. We approach the task as a classification problem, and show that both Na¨ıve Bayes and Random Forests can provide substantial increases in the accuracy of predicted binding sites. Our results extend previous work which simply filtered candidate sites based on H3K4Me3 chromatin modification scores. In addition we apply feature selection to explore which forms of chromatin modification and which auxiliary features have predictive value for which transcription factors.

[1]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[2]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[3]  Alexander E. Kel,et al.  Whole Genome Human/Mouse Phylogenetic Footprinting of Potential Transcription Regulatory Signals , 2003, Pacific Symposium on Biocomputing.

[4]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[5]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[6]  Michael Q. Zhang,et al.  High-resolution human core-promoter prediction with CoreBoost_HM. , 2009, Genome research.

[7]  J. Thierry-Mieg,et al.  AceView: a comprehensive cDNA-supported gene and transcripts annotation , 2006, Genome Biology.

[8]  T. Kouzarides Chromatin Modifications and Their Function , 2007, Cell.

[9]  Edgar Wingender,et al.  The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation , 2008, Briefings Bioinform..

[10]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[11]  M. Bulyk Computational prediction of transcription-factor binding site locations , 2003, Genome Biology.

[12]  Michael Q. Zhang,et al.  Combinatorial patterns of histone acetylations and methylations in the human genome , 2008, Nature Genetics.

[13]  Martin C. Frith,et al.  Cluster-Buster: finding dense clusters of motifs in DNA sequences , 2003, Nucleic Acids Res..

[14]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[15]  Bing Ren,et al.  Prediction of regulatory elements in mammalian genomes using chromatin signatures , 2008, BMC Bioinformatics.

[16]  John Hawkins,et al.  Assessing phylogenetic motif models for predicting transcription factor binding sites , 2009, Bioinform..

[17]  Julio Collado-Vides,et al.  Evaluation of thresholds for the detection of binding sites for regulatory proteins in Escherichia coli K12 DNA , 2002, Genome Biology.

[18]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[19]  Alexander J. Hartemink,et al.  A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast , 2007, PLoS Comput. Biol..

[20]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[23]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[24]  Michael Gribskov,et al.  Combining evidence using p-values: application to sequence homology searches , 1998, Bioinform..

[25]  B. Ren,et al.  An Integrated Approach to Identifying Cis-Regulatory Modules in the Human Genome , 2009, PloS one.

[26]  Dustin E. Schones,et al.  Chromatin poises miRNA- and protein-coding genes for expression. , 2009, Genome research.

[27]  Giacomo Finocchiaro,et al.  Myc-binding-site recognition in the human genome is determined by chromatin context , 2006, Nature Cell Biology.

[28]  T. Bailey,et al.  High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites , 2008, Nucleic acids research.

[29]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.