An Evaluation of Information Content as a Metric for the Inference of Putative Conserved Noncoding Regions in DNA Sequences Using a Genetic Algorithms Approach

In previous work, we presented GAMI [1], an approach to motif inference that uses a genetic algorithms search. GAMI is designed specifically to find putative conserved regulatory motifs in noncoding regions of divergent species and is designed to allow for analysis of long nucleotide sequences. In this work, we compare GAMI's performance when run with its original fitness function (a simple count of the number of matches) and when run with information content (IC), as well as several variations on these metrics. Results indicate that IC does not identify highly conserved regions and, thus, is not the appropriate metric for this task, whereas variations on IC, as well as the original metric, succeed in identifying putative conserved regions.

[1]  R. Sibly,et al.  Discovering patterns in microsatellite flanks with evolutionary computation by evolving discriminatory DNA motifs , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[2]  Jenny R. Roberts,et al.  Accelerated Ovarian Failure Induced by 4-Vinyl Cyclohexene Diepoxide in Nrf2 Null Mice , 2006, Molecular and Cellular Biology.

[3]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[4]  N. Benvenisty,et al.  Involvement of hepatocyte nuclear factor 3 in endoderm differentiation of embryonic stem cells , 1997, Molecular and cellular biology.

[5]  Axel Meyer,et al.  Evolutionary conservation of regulatory elements in vertebrate Hox gene clusters. , 2003, Genome research.

[6]  G. Fogel,et al.  Discovery of sequence motifs related to coexpression of genes using evolutionary computation. , 2004, Nucleic acids research.

[7]  Nancy F. Hansen,et al.  Comparative analyses of multi-species sequences from targeted genomic regions , 2003, Nature.

[8]  N. Mouchel,et al.  HNF1alpha is involved in tissue-specific regulation of CFTR gene expression. , 2004, The Biochemical journal.

[9]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[10]  T. Andrews,et al.  The Ensembl automatic gene annotation system. , 2004, Genome research.

[11]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[12]  J. Touchman,et al.  Vertebrate genome sequencing: building a backbone for comparative genomics. , 2002, Trends in genetics : TIG.

[13]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[14]  A. Rzhetsky,et al.  The human ATP-binding cassette (ABC) transporter superfamily. , 2001, Genome research.

[15]  Robert Edwards,et al.  Glutathione Transferases , 2010, The arabidopsis book.

[16]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[17]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[18]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[19]  L. Pennacchio,et al.  Comparative genomic tools and databases: providing insights into the human genome. , 2003, The Journal of clinical investigation.

[20]  G. Crooks,et al.  WebLogo: A sequence logo generator, Genome Research, , 2004 .

[21]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[22]  S. Brenner,et al.  Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[23]  David Corne,et al.  Evolving core promoter signal motifs , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[24]  Lawrence. Davis,et al.  Handbook Of Genetic Algorithms , 1990 .

[25]  D. Townsend,et al.  Glutathione S-transferase polymorphisms: cancer incidence and therapy , 2006, Oncogene.

[26]  S. Cole,et al.  Toxicological relevance of the multidrug resistance protein 1, MRP1 (ABCC1) and related transporters. , 2001, Toxicology.

[27]  Klaudia Walter,et al.  Highly Conserved Non-Coding Sequences Are Associated with Vertebrate Development , 2004, PLoS biology.

[28]  Geoffrey J. Barton,et al.  The Jalview Java alignment editor , 2004, Bioinform..

[29]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[30]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[31]  Thomas Werner,et al.  MatInspector and beyond: promoter analysis based on transcription factor binding sites , 2005, Bioinform..

[32]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[33]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[34]  Andrew M. Tyrrell,et al.  The evolutionary computation approach to motif discovery in biological sequences , 2005, GECCO '05.

[35]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[36]  Carolyn J. Mattingly,et al.  Preliminary Results for GAMI: A Genetic Algorithms Approach to Motif Inference , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[37]  W. Miller,et al.  Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. , 2000, Science.

[38]  N. Plant,et al.  Role of Sp1, C/EBP alpha, HNF3, and PXR in the basal- and xenobiotic-mediated regulation of the CYP3A4 gene. , 2004, Drug metabolism and disposition: the biological fate of chemicals.

[39]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[40]  Stephen H. Bryant,et al.  CD-Search: protein domain annotations on the fly , 2004, Nucleic Acids Res..

[41]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[42]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[43]  I-Min A. Dubchak,et al.  Active conservation of noncoding sequences revealed by three-way species comparisons. , 2000, Genome research.

[44]  C. Higgins,et al.  ABC transporters: from microorganisms to man. , 1992, Annual review of cell biology.