Entropic Profiler – detection of conservation in genomes using information theory

BackgroundIn the last decades, with the successive availability of whole genome sequences, many research efforts have been made to mathematically model DNA. Entropic Profiles (EP) were proposed recently as a new measure of continuous entropy of genome sequences. EP represent local information plots related to DNA randomness and are based on information theory and statistical concepts. They express the weighed relative abundance of motifs for each position in genomes. Their study is very relevant because under or over-representation segments are often associated with significant biological meaning.FindingsThe Entropic Profiler application here presented is a new tool designed to detect and extract under and over-represented DNA segments in genomes by using EP. It allows its computation in a very efficient way by recurring to improved algorithms and data structures, which include modified suffix trees. Available through a web interface http://kdbio.inesc-id.pt/software/ep/ and as downloadable source code, it allows to study positions and to search for motifs inside the whole sequence or within a specified range. DNA sequences can be entered from different sources, including FASTA files, pre-loaded examples or resuming a previously saved work. Besides the EP value plots, p-values and z-scores for each motif are also computed, along with the Chaos Game Representation of the sequence.ConclusionEP are directly related with the statistical significance of motifs and can be considered as a new method to extract and classify significant regions in genomes and estimate local scales in DNA. The present implementation establishes an efficient and useful tool for whole genome analysis.

[1]  Hong Yan,et al.  Advanced Computational Methods for Biocomputing And Bioimaging , 2006 .

[2]  Maxime Crochemore,et al.  Zones of Low Entropy in Genomic Sequences , 1999, Comput. Chem..

[3]  D. Dubnau,et al.  DNA uptake in bacteria. , 1999, Annual review of microbiology.

[4]  J. Schwartz,et al.  Annotating large genomes with exact word matches. , 2003, Genome research.

[5]  Joong Chae Na,et al.  Truncated suffix trees and their application to data compression , 2003, Theor. Comput. Sci..

[6]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[7]  Jonas S. Almeida,et al.  Rényi continuous entropy of DNA sequences. , 2004, Journal of theoretical biology.

[8]  Ross Lippert,et al.  Space-Efficient Whole Genome Comparisons with BurrowsWheeler Transforms , 2005, J. Comput. Biol..

[9]  Stéphane Robin,et al.  DNA, words and models , 2005 .

[10]  Marcel H. Schulz,et al.  The generalised k-Truncated Suffix Tree for time-and space-efficient searches in multiple DNA or protein sequences , 2008, Int. J. Bioinform. Res. Appl..

[11]  Arlindo L. Oliveira,et al.  An analysis of the positional distribution of DNA motifs in promoter regions and its biological relevance , 2007, BMC Bioinformatics.

[12]  S. Karlin,et al.  Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. , 1996, Nucleic acids research.

[13]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[14]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[15]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[16]  Masaru Tomita,et al.  Validating the significance of genomic properties of Chi sites from the distribution of all octamers in Escherichia coli. , 2007, Gene.

[17]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[18]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[19]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[20]  J. Oliver,et al.  Entropic profiles of DNA sequences through chaos-game-derived images. , 1993, Journal of theoretical biology.

[21]  Trevor I. Dix,et al.  Comparative analysis of long DNA sequences by per element information content using different contexts , 2007, BMC Bioinformatics.

[22]  Julien Allali,et al.  The at most k-deep factor tree , 2003 .

[23]  Gad M. Landau,et al.  Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity , 2002, Bioinform..

[24]  Jonas S. Almeida,et al.  Local Renyi entropic profiles of DNA sequences , 2007, BMC Bioinformatics.

[25]  A. Rényi On Measures of Entropy and Information , 1961 .

[26]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[27]  N. Jesper Larsson Extended application of suffix trees to data compression , 1996, Proceedings of Data Compression Conference - DCC '96.

[28]  Meriem El Karoui,et al.  Identification of the Chi site of Haemophilus influenzae as several sequences related to the Escherichia coli Chi site , 1998, Molecular microbiology.