Boosting Binding Sites Prediction Using Gene's Positions

Understanding transcriptional regulation requires a reliable identification of the DNA binding sites that are recognized by each transcription factor (TF). Building an accurate bioinformatic model of TF-DNA binding is an essential step to differentiate true binding targets from spurious ones. Conventional approches of binding site prediction are based on the notion of consensus sequences. They are formalized by the so-called position-specific weight matrices (PWM) and rely on the statistical analysis of DNA sequence of known binding sites. To improve these techniques, we propose to use genome organization knowledge about the optimal positioning of co-regulated genes along the whole chromosome. For this purpose, we use learning machine approaches to optimally combine sequence information with positioning information. We present a new learning algorithm called PreCisIon, which relies on a TF binding classifier that optimally combines a set of PWMs and chrommosal position based classifiers. This non-linear binding decision rule drastically reduces the rate of false positives so that PRECISION consistently outperforms sequence-based methods. This is shown by implementing a cross-validation analysis in two model organisms: Escherichia coli and Bacillus Subtilis. The analysis is based on the identification of binding sites for 24 TFs; PRECISION achieved on average an AUC (aera under the curve) of 70% and 60%, a sensitivity of 80% and 70%, and a specificity of 60% and 56% for B. subtilis and E. coli, respectively.

[1]  Fangping Mu,et al.  Using Sequence-Specific Chemical and Structural Properties of DNA to Predict Transcription Factor Binding Sites , 2010, PLoS Comput. Biol..

[2]  R. Flavell,et al.  Interchromosomal associations between alternatively expressed loci , 2005, Nature.

[3]  Ivan Junier,et al.  Periodic pattern detection in sparse boolean sequences , 2010, Algorithms for Molecular Biology.

[4]  Julio Collado-Vides,et al.  RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation , 2007, Nucleic Acids Res..

[5]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[6]  L. Pennacchio,et al.  Genomic strategies to identify mammalian regulatory sequences , 2001, Nature Reviews Genetics.

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  S. Leibler,et al.  DNA looping and physical constraints on transcription regulation. , 2003, Journal of molecular biology.

[9]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[10]  Jennifer A. Mitchell,et al.  Preferential associations between co-regulated genes reveal a transcriptional interactome in erythroid cells , 2010, Nature Genetics.

[11]  O. Kuipers,et al.  Mechanisms and Evolution of Control Logic in Prokaryotic Transcriptional Regulation , 2009, Microbiology and Molecular Biology Reviews.

[12]  L. Mirny,et al.  How gene order is influenced by the biophysics of transcription regulation , 2007, Proceedings of the National Academy of Sciences.

[13]  Kenta Nakai,et al.  DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information , 2007, Nucleic Acids Res..

[14]  Peter R. Cook,et al.  Similar active genes cluster in specialized transcription factories , 2008, The Journal of cell biology.

[15]  Jacques van Helden,et al.  RSAT: regulatory sequence analysis tools , 2008, Nucleic Acids Res..

[16]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[17]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[18]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[19]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[20]  Peter R. Cook,et al.  Predicting three-dimensional genome structure from transcriptional activity , 2002, Nature Genetics.

[21]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[22]  P. Fraser,et al.  Nuclear organization of the genome and the potential for gene regulation , 2007, Nature.

[23]  Ivan Junier,et al.  Spatial and Topological Organization of DNA Chains Induced by Gene Co-localization , 2010, PLoS Comput. Biol..

[24]  B. Müller-Hill,et al.  The function of auxiliary operators , 1998, Molecular microbiology.

[25]  Cédric Vaillant,et al.  Transcription-Based Solenoidal Model of Chromosomes , 2004, Complexus.

[26]  Céline Rouveirol,et al.  LICORN: learning cooperative regulation networks from gene expression data , 2007, Bioinform..

[27]  G. Stormo,et al.  Determining the specificity of protein–DNA interactions , 2010, Nature Reviews Genetics.

[28]  Ching Y. Suen,et al.  Optimal combinations of pattern classifiers , 1995, Pattern Recognit. Lett..

[29]  Bruno Torrésani,et al.  Decoding the nucleoid organisation of Bacillus subtilis and Escherichia coli through gene expression data , 2005, BMC Genomics.

[30]  Daniel Segrè,et al.  Chromosomal periodicity of evolutionarily conserved gene pairs , 2007, Proceedings of the National Academy of Sciences.

[31]  François Képès,et al.  Periodic transcriptional organization of the E.coli genome. , 2004, Journal of molecular biology.

[32]  Chong Sun Hong,et al.  Optimal Threshold from ROC and CAP Curves , 2009, Commun. Stat. Simul. Comput..