Assembly-free and alignment-free sample identification using genome skims

The ability to quickly and inexpensively describe taxonomic diversity is critical in this era of rapid climate and biodiversity changes. The currently preferred molecular technique, barcoding, has been very successful, but is based on short organelle markers. Recently, an alternative genome-skimming approach has been proposed: low-pass sequencing (100Mb – several Gb per sample) is applied to voucher and/or query samples, and marker genes and/or organelle genomes are recovered computationally. The current practice of genome-skimming discards the vast majority of the data because the low coverage of genome-skims prevents assembling the nuclear genomes. In contrast, we suggest using all unassembled reads directly, but existing methods poorly support this goal. We introduce a new alignment-free tool, Skmer, to estimate genomic distances between the query and each reference genome-skim using the k-mer decomposition of reads. We test Skmer on a large set of insect and bird genomes, sub-sampled to create genome-skims. Skmer shows great accuracy in estimating genomic distances, identifying the closest match in a reference dataset, and inferring the phylogeny. The software is publicly available on https://github.com/shahab-sarmashghi/Skmer.git

[1]  Jonathan A. Eisen,et al.  Bacterial Communities of Diverse Drosophila Species: Ecological Context of a Host–Microbe Model System , 2011, PLoS genetics.

[2]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[3]  S. Jeffery Evolution of Protein Molecules , 1979 .

[4]  Jos Houbraken,et al.  Prospects for fungus identification using CO1 DNA barcodes, with Penicillium as a test case , 2007, Proceedings of the National Academy of Sciences.

[5]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[6]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[7]  Burkhard Morgenstern,et al.  Estimating evolutionary distances between genomic sequences from spaced-word matches , 2015, Algorithms for Molecular Biology.

[8]  P. Taberlet,et al.  Towards next‐generation biodiversity assessment using DNA metabarcoding , 2012, Molecular ecology.

[9]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[10]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[11]  Mark Fishbein,et al.  Navigating the tip of the genomic iceberg: Next-generation sequencing for plant systematics. , 2012, American journal of botany.

[12]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[13]  J. G. Burleigh,et al.  Synthesis of phylogeny and taxonomy into a comprehensive tree of life , 2014, Proceedings of the National Academy of Sciences.

[14]  W. Maddison,et al.  A combined molecular approach to phylogeny of the jumping spider subfamily dendryphantinae (araneae: salticidae). , 2001, Molecular phylogenetics and evolution.

[15]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[16]  David Haussler,et al.  Alignathon: a competitive assessment of whole-genome alignment methods , 2014, bioRxiv.

[17]  A. Meyer,et al.  TaxI: a software tool for DNA barcoding using distance methods , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[18]  W. John Kress,et al.  A DNA barcode for land plants , 2009, Proceedings of the National Academy of Sciences.

[19]  Robert D. Nowak,et al.  Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  H Kishino,et al.  Freeing phylogenies from artifacts of alignment. , 1992, Molecular biology and evolution.

[21]  C. Moritz,et al.  DNA barcoding will often fail to discover new animal species over broad parameter space. , 2006, Systematic biology.

[22]  T. Warnow,et al.  Phylogenomic analyses data of the avian phylogenomics project , 2015, GigaScience.

[23]  Hong Luo,et al.  CVTree: a phylogenetic tree reconstruction tool based on whole genomes , 2004, Nucleic Acids Res..

[24]  Greg W. Rouse,et al.  A new species of Ophryotrocha (Annelida, Eunicida, Dorvilleidae) from hydrothermal vents on the Southwest Indian Ridge , 2017, ZooKeys.

[25]  Bernhard Haubold,et al.  Alignment-free detection of local similarity among viral and bacterial genomes , 2011, Bioinform..

[26]  Alok Bhattacharya,et al.  Next-Generation Anchor Based Phylogeny (NexABP): Constructing phylogeny from Next-generation sequencing data , 2013, Scientific Reports.

[27]  V. Savolainen,et al.  Towards writing the encyclopaedia of life: an introduction to DNA barcoding , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[28]  Greg W. Rouse,et al.  Systematics of Himerometra (Echinodermata: Crinoidea: Himerometridae) based on morphology and molecular data , 2017 .

[29]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[30]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[31]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[32]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[33]  Jia Gu,et al.  fastp: an ultra-fast all-in-one FASTQ preprocessor , 2018, bioRxiv.

[34]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[35]  Alejandro A. Schäffer,et al.  Database indexing for production MegaBLAST searches , 2008, Bioinform..

[36]  David Tse,et al.  Optimal assembly for high throughput shotgun sequencing , 2013, BMC Bioinformatics.

[37]  Arie van der Meijden,et al.  Comparative performance of the 16S rRNA gene in DNA barcoding of amphibians , 2005, Frontiers in Zoology.

[38]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[39]  Fei Li,et al.  InsectBase: a resource for insect genomes and transcriptomes , 2015, Nucleic Acids Res..

[40]  Constantinos Daskalakis,et al.  Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis , 2011, ArXiv.

[41]  Barbara J. Sharanowski,et al.  Utility of the DNA barcoding gene fragment for parasitic wasp phylogeny (Hymenoptera: Ichneumonoidea): data release and new measure of taxonomic congruence , 2012, Molecular ecology resources.

[42]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[43]  M. Ragan,et al.  Is Multiple-Sequence Alignment Required for Accurate Inference of Phylogeny? , 2007, Systematic biology.

[44]  D. Robinson,et al.  Comparison of weighted labelled trees , 1979 .

[45]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[46]  Yi Luo,et al.  How independent are the appearances of n-mers in different genomes? , 2004, Bioinform..

[47]  Anthony R. Ives,et al.  An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data , 2015, BMC Genomics.

[48]  Thomas Wiehe,et al.  Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..

[49]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads , 2013, J. Comput. Biol..

[50]  Ilan Shomorony,et al.  Information-optimal genome assembly via sparse read-overlap graphs , 2016, Bioinform..

[51]  Tandy J. Warnow,et al.  SEPP: SATe -Enabled Phylogenetic Placement , 2011, Pacific Symposium on Biocomputing.

[52]  Julian Tonti-Filippini,et al.  What can we do with 1000 plastid genomes? , 2017, The Plant journal : for cell and molecular biology.

[53]  B. Lemaître,et al.  Gut-associated microbes of Drosophila melanogaster , 2012, Gut microbes.

[54]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[55]  John L. Spouge,et al.  Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi , 2012, Proceedings of the National Academy of Sciences.

[56]  Liqing Zhang,et al.  Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction , 2008, Nucleic acids research.

[57]  Dominique Lavenier,et al.  Multiple comparative metagenomics using multiset k-mer counting , 2016, PeerJ Comput. Sci..

[58]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[59]  Olivier Gascuel,et al.  FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program , 2015, Molecular biology and evolution.

[60]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[61]  Julia Zeitlinger,et al.  Highly Contiguous Genome Assemblies of 15 Drosophila Species Generated Using Nanopore Sequencing , 2018, G3: Genes, Genomes, Genetics.

[62]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[63]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[64]  N. Baeshen,et al.  Biological Identifications Through DNA Barcodes , 2012 .

[65]  Burkhard Morgenstern,et al.  Fast and accurate phylogeny reconstruction using filtered spaced-word matches , 2017, Bioinform..

[66]  Bernhard Haubold,et al.  Alignment-free phylogenetics and population genetics , 2014, Briefings Bioinform..

[67]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[68]  Pierre Taberlet,et al.  From barcodes to genomes: extending the concept of DNA barcoding , 2016, Molecular ecology.

[69]  Edan Foley,et al.  Comparative evaluation of the genomes of three common Drosophila-associated bacteria , 2016, Biology Open.

[70]  Burkhard Morgenstern,et al.  Phylogeny reconstruction based on the length distribution of k-mismatch common substrings , 2017, Algorithms for Molecular Biology.

[71]  Denis Krompass,et al.  Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood , 2011, Systematic biology.

[72]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[73]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[74]  Josino Costa Moreira,et al.  DNA barcoding for conservation and management of Amazonian commercial fish. , 2010 .