Kullback Leibler divergence in complete bacterial and phage genomes

The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback–Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses.

[1]  Yang Young Lu,et al.  VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data , 2017, Microbiome.

[2]  Timothy J. Harlow,et al.  Does Sequence Conservation Provide Evidence for Biological Function? , 2017, Trends in Microbiology.

[3]  S. Aris-Brosou,et al.  Widespread Historical Contingency in Influenza Viruses , 2016, Genetics.

[4]  Manlio De Domenico,et al.  Spectral entropies as information-theoretic tools for complex network comparison , 2016, 1609.01214.

[5]  Vincenzo Manca,et al.  Informational laws of genome structures , 2016, Scientific Reports.

[6]  C. Putonti,et al.  The use of informativity in the development of robust viromics-based examinations , 2016, bioRxiv.

[7]  Alberto Pallavicini,et al.  Analysis of synonymous codon usage patterns in sixty-four different bivalve species , 2015, PeerJ.

[8]  Susana Vinga,et al.  Information theory applications for biological sequence analysis , 2013, Briefings Bioinform..

[9]  John P. Huelsenbeck,et al.  A Phylogenetic Model for the Detection of Epistatic Interactions , 2013, Molecular biology and evolution.

[10]  Peter Salamon,et al.  Applying Shannon's information theory to bacterial and phage genomes and metagenomes , 2013, Scientific Reports.

[11]  Fangfang Xia,et al.  SEED Servers: High-Performance Access to the SEED Genomes, Annotations, and Metabolic Models , 2012, PloS one.

[12]  Robert A. Edwards,et al.  PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies , 2012, Nucleic acids research.

[13]  Christoph Adami,et al.  Annals of the New York Academy of Sciences the Use of Information Theory in Evolutionary Biology , 2022 .

[14]  David R. Kelley,et al.  Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering , 2011, Nucleic acids research.

[15]  Paulien Hogeweg,et al.  Toward a theory of multilevel evolution: long-term information integration shapes the mutational landscape and enhances evolvability. , 2012, Advances in experimental medicine and biology.

[16]  D. Ussery,et al.  Relative entropy differences in bacterial chromosomes, plasmids, phages and genomic islands , 2012, BMC Genomics.

[17]  M. Collin,et al.  Characterization and genome sequencing of two Propionibacterium acnes phages displaying pseudolysogeny , 2011, BMC Genomics.

[18]  Ramy K. Aziz Subsystems-based servers for rapid annotation of genomes and metagenomes , 2010, BMC Bioinformatics.

[19]  James J. Davis,et al.  Modal Codon Usage: Assessing the Typical Codon Usage of a Genome , 2009, Molecular biology and evolution.

[20]  Li Li,et al.  Computational approaches for microRNA studies: a review , 2010, Mammalian Genome.

[21]  Hong-Da Chen,et al.  Inverse Symmetry in Complete Genomes and Whole-Genome Inverse Duplication , 2009, PloS one.

[22]  I. Hofacker,et al.  From consensus structure prediction to RNA gene finding. , 2009, Briefings in functional genomics & proteomics.

[23]  Maria S. Poptsova,et al.  Hidden Chromosome Symmetry: In Silico Transformation Reveals Symmetry in 2D DNA Walk Trajectories of 671 Chromosomes , 2009, PloS one.

[24]  F. Brinkman,et al.  Bioinformatic detection of horizontally transferred DNA in bacterial genomes , 2009, F1000 biology reports.

[25]  Eugene V Koonin,et al.  Evolution of genome architecture. , 2009, The international journal of biochemistry & cell biology.

[26]  David Ussery,et al.  Investigations of Oligonucleotide Usage Variance Within and Between Prokaryotes , 2008, PLoS Comput. Biol..

[27]  Horizontal gene transfer and the evolution of transcriptional regulation in Escherichia coli , 2008, Genome Biology.

[28]  H. Ochman,et al.  The Nature and Dynamics of Bacterial Genomes , 2006, Science.

[29]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[30]  Chang-Heng Chang,et al.  Divergence and Shannon information in genomes. , 2004, Physical review letters.

[31]  Liaofu Luo,et al.  Shannon information in complete genomes , 2005, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[32]  H. Najafabadi,et al.  Correspondence regarding Bharanidharan et al., "Correlations between nucleotide frequencies and amino acid composition in 115 bacterial species". , 2004, Biochemical and biophysical research communications.

[33]  Rick L. Stevens,et al.  The SEED: a peer-to-peer environment for genome annotation , 2004, CACM.

[34]  N Gautham,et al.  Correlations between nucleotide frequencies and amino acid composition in 115 bacterial species. , 2004, Biochemical and biophysical research communications.

[35]  B. Zeeberg,et al.  Shannon information theoretic computation of synonymous codon usage biases in coding regions of human and mouse genomes. , 2002, Genome research.

[36]  Eduardo P C Rocha,et al.  Base composition bias might result from competition for metabolic resources. , 2002, Trends in genetics : TIG.

[37]  Claude-Alain H. Roten,et al.  Comparative Genometrics (CG): a database dedicated to biometric comparisons of whole genomes , 2002, Nucleic Acids Res..

[38]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[39]  C. Gautier,et al.  Compositional bias in DNA. , 2000, Current opinion in genetics & development.

[40]  H. Ochman,et al.  Lateral gene transfer and the nature of bacterial innovation , 2000, Nature.

[41]  S Karlin,et al.  Detecting Alien Genes in Bacterial Genomes a , 1999, Annals of the New York Academy of Sciences.

[42]  A. Grigoriev Strand-specific compositional asymmetries in double-stranded DNA viruses. , 1999, Virus research.

[43]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[44]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[45]  J. Lobry,et al.  Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species. , 1997, Gene.

[46]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .