A Relative-Entropy Algorithm for Genomic Fingerprinting Captures Host-Phage Similarities

ABSTRACT The degeneracy of codons allows a multitude of possible sequences to code for the same protein. Hidden within the particular choice of sequence for each organism are over 100 previously undiscovered biologically significant, short oligonucleotides (length, 2 to 7 nucleotides). We present an information-theoretic algorithm that finds these novel signals. Applying this algorithm to the 209 sequenced bacterial genomes in the NCBI database, we determine a set of oligonucleotides for each bacterium which uniquely characterizes the organism. Some of these signals have known biological functions, like restriction enzyme binding sites, but most are new. An accompanying scoring algorithm is introduced that accurately (92%) places sequences of 100 kb with their correct species among the choice of hundreds. This algorithm also does far better than previous methods at relating phage genomes to their bacterial hosts, suggesting that the lists of oligonucleotides are “genomic fingerprints” that encode information about the effects of the cellular environment on DNA sequence. Our approach provides a novel basis for phylogeny and is potentially ideally suited for classifying the short DNA fragments obtained by environmental shotgun sequencing. The methods developed here can be readily extended to other problems in bioinformatics.

[1]  S Karlin,et al.  Similarities and dissimilarities of phage genomes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Arnold J Levine,et al.  Tissue-specific codon usage and the expression of human genes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[4]  M. Magnasco,et al.  Decay rates of human mRNAs: correlation with functional characteristics and sequence attributes. , 2003, Genome research.

[5]  C. Reilly,et al.  Genome-wide analysis of mRNA decay in resting and activated primary human T lymphocytes. , 2002, Nucleic acids research.

[6]  Christopher B. Burge,et al.  RESCUE-ESE identifies candidate exonic splicing enhancers in vertebrate exons , 2004, Nucleic Acids Res..

[7]  Micah Acinapura,et al.  Computational DNA Sequence Analysis , 2003 .

[8]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[9]  Alexander E Vinogradov,et al.  Isochores and tissue-specificity. , 2003, Nucleic acids research.

[10]  T. Ikemura Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. , 1981, Journal of molecular biology.

[11]  A. Fuglsang,et al.  The relationship between palindrome avoidance and intragenic codon usage variations: a Monte Carlo study. , 2004, Biochemical and biophysical research communications.