Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes

BackgroundEscherichia coli exists in commensal and pathogenic forms. By measuring the variation of individual genes across more than a hundred sequenced genomes, gene variation can be studied in detail, including the number of mutations found for any given gene. This knowledge will be useful for creating better phylogenies, for determination of molecular clocks and for improved typing techniques.ResultsWe find 3,051 gene clusters/families present in at least 95% of the genomes and 1,702 gene clusters present in 100% of the genomes. The former 'soft core' of about 3,000 gene families is perhaps more biologically relevant, especially considering that many of these genome sequences are draft quality. The E. coli pan-genome for this set of isolates contains 16,373 gene clusters.A core-gene tree, based on alignment and a pan-genome tree based on gene presence/absence, maps the relatedness of the 186 sequenced E. coli genomes. The core-gene tree displays high confidence and divides the E. coli strains into the observed MLST type clades and also separates defined phylotypes.ConclusionThe results of comparing a large and diverse E. coli dataset support the theory that reliable and good resolution phylogenies can be inferred from the core-genome. The results further suggest that the resolution at the isolate level may, subsequently be improved by targeting more variable genes. The use of whole genome sequencing will make it possible to eliminate, or at least reduce, the need for several typing steps used in traditional epidemiology.

[1]  D. Penny,et al.  The modern molecular clock , 2003, Nature Reviews Genetics.

[2]  K. Schleifer,et al.  Classification of Bacteria and Archaea: past, present and future. , 2009, Systematic and applied microbiology.

[3]  T. Whittam,et al.  Cryptic Lineages of the Genus Escherichia , 2009, Applied and Environmental Microbiology.

[4]  O. Clermont,et al.  Rapid and Simple Determination of theEscherichia coli Phylogenetic Group , 2000, Applied and Environmental Microbiology.

[5]  D. Ussery,et al.  Comparison of 61 Sequenced Escherichia coli Genomes , 2010, Microbial Ecology.

[6]  E. Denamur,et al.  Escherichia , 2020, Definitions.

[7]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[8]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[9]  E. Denamur,et al.  Characterization of the cryptic Escherichia lineages: rapid identification and prevalence. , 2011, Environmental microbiology.

[10]  Thomas D. Otto,et al.  Real-time sequencing , 2011, Nature Reviews Microbiology.

[11]  S A Krawetz Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation. , 1989, Nucleic acids research.

[12]  Hung Bui,et al.  A specific genetic background is required for acquisition and expression of virulence factors in Escherichia coli. , 2004, Molecular biology and evolution.

[13]  D. Ussery,et al.  Standard operating procedure for computing pangenome trees , 2010, Standards in genomic sciences.

[14]  Maxime Durot,et al.  Core and Panmetabolism in Escherichia coli , 2011, Journal of bacteriology.

[15]  Pascal Lapierre,et al.  Estimating the size of the bacterial pan-genome. , 2009, Trends in genetics : TIG.

[16]  Daniel Falush,et al.  Sex and virulence in Escherichia coli: an evolutionary perspective , 2006, Molecular microbiology.

[17]  David W Ussery,et al.  Characterization of probiotic Escherichia coli isolates with a novel pan-genome microarray , 2007, Genome Biology.

[18]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[19]  David W. Lacher,et al.  EcMLST: an online database for multi locus sequence typing of pathogenic Escherichia coli , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[20]  Albert J. Vilella,et al.  Genome-wide DNA polymorphism analyses using VariScan , 2006, BMC Bioinformatics.

[21]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[22]  G. Pupo,et al.  Multiple independent origins of Shigella clones of Escherichia coli and convergent evolution of many of their characteristics. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[23]  D. Dykhuizen,et al.  High frequency of hotspot mutations in core genes of Escherichia coli due to short-term positive selection , 2009, Proceedings of the National Academy of Sciences.

[24]  Sung-Hou Kim,et al.  Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs) , 2011, Proceedings of the National Academy of Sciences.

[25]  M. Nei,et al.  Mathematical model for studying genetic variation in terms of restriction endonucleases. , 1979, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Jeffrey E. Barrick,et al.  Genome evolution and adaptation in a long-term experiment with Escherichia coli , 2009, Nature.

[27]  Ole Lund,et al.  Multilocus Sequence Typing of Total-Genome-Sequenced Bacteria , 2012, Journal of Clinical Microbiology.

[28]  L. Pickering,et al.  A millennium update on pediatric diarrheal illness in the developing world. , 2005, Seminars in pediatric infectious diseases.

[29]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[30]  T. Cebula,et al.  Genomic anatomy of Escherichia coli O157:H7 outbreaks , 2011, Proceedings of the National Academy of Sciences.

[31]  Trygve Almøy,et al.  Microbial comparative pan-genomics using binomial mixture models , 2009, BMC Genomics.

[32]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[33]  E. Denamur,et al.  The Evolutionary History of Shigella and Enteroinvasive Escherichia coli Revised , 2003, Journal of Molecular Evolution.

[34]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[35]  Julian Parkhill,et al.  Evolution of MRSA During Hospital Transmission and Intercontinental Spread , 2010, Science.

[36]  B. Birren,et al.  Genome Project Standards in a New Era of Sequencing , 2009, Science.

[37]  Philipp L. Wesche,et al.  DNA Sequence Error Rates in Genbank Records Estimated using the Mouse Genome as a Reference , 2004, DNA sequence : the journal of DNA sequencing and mapping.

[38]  E. Denamur,et al.  Assigning Escherichia coli strains to phylogenetic groups: multi-locus sequence typing versus the PCR triplex method. , 2008, Environmental microbiology.

[39]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[40]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[41]  T. Russo,et al.  Medical and economic impact of extraintestinal infections due to Escherichia coli: focus on an increasingly important endemic problem. , 2003, Microbes and infection.

[42]  Olivier Tenaillon,et al.  The population genetics of commensal Escherichia coli , 2010, Nature Reviews Microbiology.

[43]  Steven J. M. Jones,et al.  Whole-genome sequencing and social-network analysis of a tuberculosis outbreak. , 2011, The New England journal of medicine.

[44]  S. vanDongen Graph Clustering by Flow Simulation , 2000 .

[45]  A. Danchin,et al.  Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths , 2009, PLoS genetics.

[46]  H. Mizoguchi,et al.  Extensive Genomic Diversity in Pathogenic Escherichia coli and Shigella Strains Revealed by Comparative Genomic Hybridization Microarray , 2004, Journal of bacteriology.

[47]  Konstantinos T. Konstantinidis,et al.  Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species , 2011, Proceedings of the National Academy of Sciences.

[48]  Peter F. Hallin,et al.  RNAmmer: consistent and rapid annotation of ribosomal RNA genes , 2007, Nucleic acids research.

[49]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[50]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, J. Comput. Biol..

[51]  Chris F. Taylor,et al.  The minimum information about a genome sequence (MIGS) specification , 2008, Nature Biotechnology.

[52]  Michael J. Stanhope,et al.  Evolutionary Dynamics of Complete Campylobacter Pan-Genomes and the Bacterial Species Concept , 2010, Genome biology and evolution.