Reliability and applications of statistical methods based on oligonucleotide frequencies in bacterial and archaeal genomes

BackgroundThe increasing number of sequenced prokaryotic genomes contains a wealth of genomic data that needs to be effectively analysed. A set of statistical tools exists for such analysis, but their strengths and weaknesses have not been fully explored. The statistical methods we are concerned with here are mainly used to examine similarities between archaeal and bacterial DNA from different genomes. These methods compare observed genomic frequencies of fixed-sized oligonucleotides with expected values, which can be determined by genomic nucleotide content, smaller oligonucleotide frequencies, or be based on specific statistical distributions. Advantages with these statistical methods include measurements of phylogenetic relationship with relatively small pieces of DNA sampled from almost anywhere within genomes, detection of foreign/conserved DNA, and homology searches. Our aim was to explore the reliability and best suited applications for some popular methods, which include relative oligonucleotide frequencies (ROF), di- to hexanucleotide zero'th order Markov methods (ZOM) and 2.order Markov chain Method (MCM). Tests were performed on distant homology searches with large DNA sequences, detection of foreign/conserved DNA, and plasmid-host similarity comparisons. Additionally, the reliability of the methods was tested by comparing both real and random genomic DNA.ResultsOur findings show that the optimal method is context dependent. ROFs were best suited for distant homology searches, whilst the hexanucleotide ZOM and MCM measures were more reliable measures in terms of phylogeny. The dinucleotide ZOM method produced high correlation values when used to compare real genomes to an artificially constructed random genome with similar %GC, and should therefore be used with care. The tetranucleotide ZOM measure was a good measure to detect horizontally transferred regions, and when used to compare the phylogenetic relationships between plasmids and hosts, significant correlation (R2 = 0.4) was found with genomic GC content and intra-chromosomal homogeneity.ConclusionThe statistical methods examined are fast, easy to implement, and powerful for a number of different applications involving genomic sequence comparisons. However, none of the measures examined were superior in all tests, and therefore the choice of the statistical method should depend on the task at hand.

[1]  Lars J Jensen,et al.  Origin of replication in circular prokaryotic chromosomes. , 2006, Environmental microbiology.

[2]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[3]  J. Josse,et al.  Enzymatic synthesis of deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base sequences in deoxyribonucleic acid. , 1961, The Journal of biological chemistry.

[4]  S. Karlin,et al.  Global dinucleotide signatures and analysis of genomic heterogeneity. , 1998, Current opinion in microbiology.

[5]  Oleg N. Reva,et al.  Global features of sequences of bacterial chromosomes, plasmids and phages revealed by analysis of oligonucleotide usage patterns , 2004, BMC Bioinformatics.

[6]  M. Blaser,et al.  Evolutionary implications of microbial genome tetranucleotide frequency biases. , 2003, Genome research.

[7]  D. Ussery,et al.  Comparative Genomics of Pseudomonas aeruginosa PAO1 and Pseudomonas putida KT2440: Orthologs, Codon Usage, Repetitive Extragenic Palindromic Elements, and Oligonucleotide Motif Signatures , 2002 .

[8]  Eduardo P C Rocha,et al.  Base composition bias might result from competition for metabolic resources. , 2002, Trends in genetics : TIG.

[9]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[10]  Chun-Ting Zhang,et al.  Seven GC-rich microbial genomes adopt similar codon usage patterns regardless of their phylogenetic lineages. , 2003, Biochemical and biophysical research communications.

[11]  S Brunak,et al.  Structural analysis of DNA sequence: evidence for lateral gene transfer in Thermotoga maritima. , 2000, Nucleic acids research.

[12]  Shigehiko Kanaya,et al.  Informatics for unveiling hidden genome signatures. , 2003, Genome research.

[13]  J. Lobry,et al.  Synonymous codon usage and its potential link with optimal growth temperature in prokaryotes. , 2006, Gene.

[14]  E. Yeramian,et al.  Evolution of proteomes: fundamental signatures and global trends in amino acid compositions , 2006, BMC Genomics.

[15]  Jason G. Bragg,et al.  Variation among species in proteomic sulphur content is related to environmental conditions , 2006, Proceedings of the Royal Society B: Biological Sciences.

[16]  Ncbi National Center for Biotechnology Information , 2008 .

[17]  Oleg N. Reva,et al.  Differentiation of regions with atypical oligonucleotide composition in bacterial genomes , 2005, BMC Bioinformatics.

[18]  S. Salzberg,et al.  Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima , 1999, Nature.

[19]  A. Kornberg,et al.  ENZYMATIC SYNTHESIS OF DEOXYRIBONUCLEIC ACID. XIV. FURTHER PURIFICATION AND PROPERTIES OF DEOXYRIBONUCLEIC ACID POLYMERASE OF ESCHERICHIA COLI. , 1964, The Journal of biological chemistry.

[20]  Christian von Mering,et al.  STRING: known and predicted protein–protein associations, integrated and transferred across organisms , 2004, Nucleic Acids Res..

[21]  D. Gevers,et al.  Towards a prokaryotic genomic taxonomy. , 2005, FEMS microbiology reviews.

[22]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[23]  G. Bernardi,et al.  Genomic GC level, optimal growth temperature, and genome size in prokaryotes. , 2006, Biochemical and biophysical research communications.

[24]  P. Bork,et al.  Environments shape the nucleotide composition of genomes , 2005, EMBO reports.

[25]  Angela C. M. Luyf,et al.  Compositional discordance between prokaryotic plasmids and host chromosomes , 2006, BMC Genomics.

[26]  K. Noll,et al.  Several Archaeal Homologs of Putative Oligopeptide-Binding Proteins Encoded by Thermotoga maritima Bind Sugars , 2006, Applied and Environmental Microbiology.

[27]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[28]  Eduardo P C Rocha,et al.  The replication-related organization of bacterial genomes. , 2004, Microbiology.

[29]  JEFF ELHAI,et al.  Determination of Bias in the Relative Abundance of Oligonucleotides in DNA Sequences , 2001, J. Comput. Biol..