High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries

A fundamental question in microbiology is whether there is a continuum of genetic diversity among genomes or clear species boundaries prevail instead. Answering this question requires robust measurement of whole-genome relatedness among thousands of genomes and from diverge phylogenetic lineages. Whole-genome similarity metrics such as Average Nucleotide Identity (ANI) can provide the resolution needed for this task, overcoming several limitations of traditional techniques used for the same purposes. Although the number of genomes currently available may be adequate, the associated bioinformatics tools for analysis are lagging behind these developments and cannot scale to large datasets. Here, we present a new method, FastANI, to compute ANI using alignment-free approximate sequence mapping. Our analyses demonstrate that FastANI produces an accurate ANI estimate and is up to three orders of magnitude faster when compared to an alignment (e.g., BLAST)-based approach. We leverage FastANI to compute pairwise ANI values among all prokaryotic genomes available in the NCBI database. Our results reveal a clear genetic discontinuity among the database genomes, with 99.8% of the total 8 billion genome pairs analyzed showing either >95% intra-species ANI or <83% inter-species ANI values. We further show that this discontinuity is recovered with or without the most frequently represented species in the database and is robust to historic additions in the public genome databases. Therefore, 95% ANI represents an accurate threshold for demarcating almost all currently named prokaryotic species, and wide species boundaries may exist for prokaryotes.

[1]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[2]  Natalia N. Ivanova,et al.  Microbial species delineation using whole genome sequences , 2015, Nucleic acids research.

[3]  R. Rosselló-Móra Updating Prokaryotic Taxonomy , 2005, Journal of bacteriology.

[4]  K. Konstantinidis,et al.  Genomic insights that advance the species definition for prokaryotes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Siv G. E. Andersson,et al.  genoPlotR: comparative gene and genome visualization in R , 2010, Bioinform..

[6]  K. Konstantinidis,et al.  The bacterial species definition in the genomic era , 2006, Philosophical Transactions of the Royal Society B: Biological Sciences.

[7]  J. Chun,et al.  OrthoANI: An improved algorithm and software for calculating average nucleotide identity. , 2016, International journal of systematic and evolutionary microbiology.

[8]  K. Konstantinidis,et al.  Toward a More Robust Assessment of Intraspecies Diversity, Using Fewer Genetic Markers , 2006, Applied and Environmental Microbiology.

[9]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[10]  Otto X. Cordero,et al.  Population Genomics of Early Events in the Ecological Differentiation of Bacteria , 2012, Science.

[11]  R. Rosselló-Móra,et al.  Shifting the genomic gold standard for the prokaryotic species definition , 2009, Proceedings of the National Academy of Sciences.

[12]  Chirag Jain,et al.  A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases , 2017 .

[13]  Konstantinos T. Konstantinidis,et al.  Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species , 2011, Proceedings of the National Academy of Sciences.

[14]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[15]  J. Banfield,et al.  dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication , 2017, The ISME Journal.

[16]  Natalia N. Ivanova,et al.  A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea , 2009, Nature.

[17]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[18]  K. Konstantinidis,et al.  Bacterial species may exist, metagenomics reveal. , 2012, Environmental microbiology.

[19]  C. Fraser,et al.  The Bacterial Species Challenge: Making Sense of Genetic and Ecological Diversity , 2009, Science.

[20]  K. Konstantinidis,et al.  The enveomics collection: a toolbox for specialized analyses of microbial genomes and metagenomes , 2016 .

[21]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[22]  Adam M. Phillippy,et al.  Interactive metagenomic visualization in a Web browser , 2011, BMC Bioinformatics.

[23]  F. Cohan Bacterial species and speciation. , 2001, Systematic biology.

[24]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[25]  P. Vandamme,et al.  DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. , 2007, International journal of systematic and evolutionary microbiology.

[26]  Nikos Kyrpides,et al.  Genome sequence of Bacillus cereus and comparative analysis with Bacillus anthracis , 2003, Nature.