An exploration into improving DNA motif inference by looking for highly conserved core regions

Although most verified functional elements in non-coding DNA contain a highly conserved core region, this concept is not generally incorporated into de novo motif inference systems. In this work, we explore the utility of adding the notion of conserved core regions into a comparative genomics approach for the search for putative functional elements in noncoding DNA. By modifying the scoring function for GAMI, Genetic Algorithms for Motif Inference, we investigate tradeoffs between the strength of conservation of the full motif vs. the strength of conservation of a core region. This work illustrates that incorporating information about the structure of transcription factor binding sites can be helpful in identifying biologically functional elements.

[1]  Steven J. M. Jones,et al.  Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. , 2006, Genome research.

[2]  M. Gerstein,et al.  Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements , 2003, Journal of biology.

[3]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[4]  E. Siggia,et al.  Connecting protein structure with predictions of regulatory sites , 2007, Proceedings of the National Academy of Sciences.

[5]  Brian N Chorley,et al.  Identification of polymorphic antioxidant response elements in the human genome. , 2007, Human molecular genetics.

[6]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[7]  D. Wanke,et al.  Studies on DNA-binding selectivity of WRKY transcription factors lend structural clues into WRKY-domain function , 2008, Plant Molecular Biology.

[8]  Joseph C. Aman,et al.  An Evaluation of Information Content as a Metric for the Inference of Putative Conserved Noncoding Regions in DNA Sequences Using a Genetic Algorithms Approach , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Takashi Kumasaka,et al.  Structural Analyses of DNA Recognition by the AML1/Runx-1 Runt Domain and Its Allosteric Control by CBFβ , 2001, Cell.

[10]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[11]  G. Fogel,et al.  A statistical analysis of the TRANSFAC database. , 2005, Bio Systems.

[12]  David J. Arenillas,et al.  JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles , 2009, Nucleic Acids Res..

[13]  Thomas Werner,et al.  MatInspector and beyond: promoter analysis based on transcription factor binding sites , 2005, Bioinform..

[14]  Qing Zhang,et al.  The Molecular Biology Toolkit (MBT): a modular platform for developing molecular visualization applications , 2005, BMC Bioinformatics.

[15]  S. Batzoglou,et al.  Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. , 2003, Genome research.

[16]  L. Mirny,et al.  Structural analysis of conserved base pairs in protein-DNA complexes. , 2002, Nucleic acids research.

[17]  Raymond C Stevens,et al.  Crystal structure and DNA binding of the homeodomain of the stem cell transcription factor Nanog. , 2008, Journal of molecular biology.

[18]  Y. Kashi,et al.  Simple sequence repeats as advantageous mutators in evolution. , 2006, Trends in genetics : TIG.

[19]  I-Min A. Dubchak,et al.  Active conservation of noncoding sequences revealed by three-way species comparisons. , 2000, Genome research.

[20]  Carolyn J. Mattingly,et al.  Preliminary Results for GAMI: A Genetic Algorithms Approach to Motif Inference , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[21]  Alexander J. Stewart,et al.  Why Transcription Factor Binding Sites Are Ten Nucleotides Long , 2012, Genetics.

[22]  Nancy F. Hansen,et al.  Comparative analyses of multi-species sequences from targeted genomic regions , 2003, Nature.

[23]  David J. Arenillas,et al.  The PAZAR database of gene regulatory information coupled to the ORCA toolkit for the study of regulatory sequences , 2008, Nucleic Acids Res..

[24]  Harri Lähdesmäki,et al.  Systematic Analysis of Disease-Related Regulatory Mutation Classes Reveals Distinct Effects on Transcription Factor Binding , 2009, Silico Biol..