Testing the infinitely many genes model for the evolution of the bacterial core genome and pangenome.

When groups of related bacterial genomes are compared, the number of core genes found in all genomes is usually much less than the mean genome size, whereas the size of the pangenome (the set of genes found on at least one of the genomes) is much larger than the mean size of one genome. We analyze 172 complete genomes of Bacilli and compare the properties of the pangenomes and core genomes of monophyletic subsets taken from this group. We then assess the capabilities of several evolutionary models to predict these properties. The infinitely many genes (IMG) model is based on the assumption that each new gene can arise only once. The predictions of the model depend on the shape of the evolutionary tree that underlies the divergence of the genomes. We calculate results for coalescent trees, star trees, and arbitrary phylogenetic trees of predefined fixed branch length. On a star tree, the pangenome size increases linearly with the number of genomes, as has been suggested in some previous studies, whereas on a coalescent tree, it increases logarithmically. The coalescent tree gives a better fit to the data, for all the examples we consider. In some cases, a fixed phylogenetic tree proved better than the coalescent tree at reproducing structure in the gene frequency spectrum, but little improvement was gained in predictions of the core and pangenome sizes. Most of the data are well explained by a model with three classes of gene: an essential class that is found in all genomes, a slow class whose rate of origination and deletion is slow compared with the time of divergence of the genomes, and a fast class showing rapid origination and deletion. Although the majority of genes originating in a genome are in the fast class, these genes are not retained for long periods, and the majority of genes present in a genome are in the slow or essential classes. In general, we show that the IMG model is useful for comparison with experimental genome data both for species level and widely divergent taxonomic groups. Software implementing the described formulae is provided at http://github.com/rec3141/pangenome.

[1]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[2]  D. Hodgson Generalized transduction of serotype 1/2 and serotype 4b strains of Listeria monocytogenes , 2000, Molecular microbiology.

[3]  Garth D Ehrlich,et al.  Comparative supragenomic analyses among the pathogens Staphylococcus aureus, Streptococcus pneumoniae, and Haemophilus influenzae Using a modification of the finite supragenome model , 2011, BMC Genomics.

[4]  P. Gajer,et al.  The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates , 2008, Journal of bacteriology.

[5]  Feng Chen,et al.  Patterns and Implications of Gene Gain and Loss in the Evolution of Prochlorococcus , 2007, PLoS genetics.

[6]  E. Rocha,et al.  Horizontal Transfer, Not Duplication, Drives the Expansion of Protein Families in Prokaryotes , 2011, PLoS genetics.

[7]  E. Koonin,et al.  Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world , 2008, Nucleic acids research.

[8]  David A Rasko,et al.  Whole genome comparisons of serotype 4b and 1/2a strains of the food-borne pathogen Listeria monocytogenes reveal new insights into the core genome components of this species. , 2004, Nucleic acids research.

[9]  Wolfgang R. Hess,et al.  The Infinitely Many Genes Model for the Distributed Genome of Bacteria , 2012, Genome biology and evolution.

[10]  Sewall Wright,et al.  The theory of gene frequencies , 1969 .

[11]  Ziniu Yu,et al.  Prevalence and diversity of insertion sequences in the genome of Bacillus thuringiensis YBT-1520 and comparison with other Bacillus cereus group members. , 2010, FEMS microbiology letters.

[12]  Pascal Lapierre,et al.  Estimating the size of the bacterial pan-genome. , 2009, Trends in genetics : TIG.

[13]  G. B. Golding,et al.  The fate of laterally transferred genes: life in the fast lane to adaptation or death. , 2006, Genome research.

[14]  M. Wiedmann,et al.  Comparative genomics of the bacterial genus Listeria: Genome evolution is characterized by limited gene acquisition and limited gene loss , 2010, BMC Genomics.

[15]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[16]  F. Blattner,et al.  Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[17]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[18]  W. Hess,et al.  The diversity of a distributed genome in bacterial populations , 2009, 0907.2572.

[19]  Wolfgang Ludwig,et al.  Revised road map to the phylum Firmicutes , 2015 .

[20]  L. Ponnala,et al.  Analysis of Ultra Low Genome Conservation in Clostridium difficile , 2010, PloS one.

[21]  A. Danchin,et al.  Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths , 2009, PLoS genetics.

[22]  Patricia Siguier,et al.  ISfinder: the reference centre for bacterial insertion sequences , 2005, Nucleic Acids Res..

[23]  Jaideep P. Sundaram,et al.  Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[24]  S. Wright Evolution and the Genetics of Populations, Volume 3: Experimental Results and Evolutionary Deductions , 1977 .

[25]  E. Charpentier,et al.  cis-Acting Sequences Required for NtcB-Dependent, Nitrite-Responsive Positive Regulation of the Nitrate Assimilation Operon in the Cyanobacterium Synechococcus sp. Strain PCC 7942 , 1998 .

[26]  Hugh Merz,et al.  Origin and evolution of gene families in Bacteria and Archaea , 2011, BMC Bioinformatics.

[27]  Carsten Wiuf,et al.  Gene Genealogies, Variation and Evolution - A Primer in Coalescent Theory , 2004 .

[28]  S. Wright,et al.  Evolution and the Genetics of Populations: Volume 2, The Theory of Gene Frequencies , 1968 .

[29]  David R. Riley,et al.  Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species , 2010, Genome Biology.

[30]  Justin S. Hogg,et al.  Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains , 2007, Genome Biology.