in silico Whole Genome Sequencer & Analyzer (iWGS): a computational pipeline to guide the design and analysis of de novo genome sequencing studies

The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimental design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.

[1]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[2]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[3]  Sergey Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[4]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[5]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[6]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[7]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[8]  Daniel Mapleson,et al.  RAMPART: a workflow management system for de novo genome assembly , 2015, Bioinform..

[9]  Sergey Koren,et al.  Automated ensemble assembly and validation of microbial genomes , 2014, BMC Bioinformatics.

[10]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[11]  Isaac Y. Ho,et al.  Meraculous: De Novo Genome Assembly with Short Paired-End Reads , 2011, PloS one.

[12]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[13]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[14]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[15]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[16]  J. Rayner,et al.  Genome sequencing of chimpanzee malaria parasites reveals possible pathways of adaptation to human hosts , 2014, Nature Communications.

[17]  Antonis Rokas,et al.  Inferring ancient divergences requires genes with strong phylogenetic signals , 2013, Nature.

[18]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[19]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[20]  Cédric Notredame,et al.  Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee , 2012, BMC Bioinformatics.

[21]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[22]  Gregory Kucherov,et al.  Using cascading Bloom filters to improve the memory usage for de Brujin graphs , 2013, Algorithms for Molecular Biology.

[23]  Miriam L. Land,et al.  Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences , 2014, Bioinform..

[24]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[25]  James G. Baldwin-Brown,et al.  Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage , 2016, bioRxiv.

[26]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[27]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[28]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[29]  Nikos Kyrpides,et al.  The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification , 2014, Nucleic Acids Res..

[30]  Jonathan E. Allen,et al.  Genome sequence of the human malaria parasite Plasmodium falciparum , 2002, Nature.

[31]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[32]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[33]  B. Purnelle,et al.  The complete sequence of the mitochondrial genome of Saccharomyces cerevisiae , 1998, FEBS letters.

[34]  Josephine T. Daub,et al.  Patterns of Positive Selection in Seven Ant Genomes , 2013, Molecular biology and evolution.

[35]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[36]  D. Hibbett,et al.  Fueling the future with fungal genomics , 2011 .

[37]  Steven Salzberg,et al.  GAGE-B: an evaluation of genome assemblers for bacterial organisms , 2013, Bioinform..

[38]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[39]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[40]  Susan J. Brown,et al.  Creating a buzz about insect genomes. , 2011, Science.

[41]  M. Pop,et al.  The Theory and Practice of Genome Sequence Assembly. , 2015, Annual review of genomics and human genetics.

[42]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[43]  Andreas R. Pfenning,et al.  Comparative genomics reveals insights into avian genome evolution and adaptation , 2014, Science.

[44]  Shoshana Marcus,et al.  Error correction and assembly complexity of single molecule sequencing reads , 2014, bioRxiv.

[45]  A. Friedrich,et al.  Mitochondrial genome evolution in yeasts: an all-encompassing view. , 2015, FEMS yeast research.

[46]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[47]  S. Oliver,et al.  Erratum: Overview of the yeast genome , 1997, Nature.

[48]  M. Schatz,et al.  Metassembler: merging and optimizing de novo genome assemblies , 2015, Genome Biology.

[49]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[50]  Mihai Pop,et al.  Exploiting sparseness in de novo genome assembly , 2012, BMC Bioinformatics.

[51]  H. Mewes,et al.  Overview of the yeast genome. , 1997, Nature.

[52]  Jose Lugo-Martinez,et al.  Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies , 2014, PLoS Comput. Biol..

[53]  S. Koren,et al.  One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. , 2015, Current opinion in microbiology.

[54]  B. Wang,et al.  The Genome Sequence of Saccharomyces eubayanus and the Domestication of Lager-Brewing Yeasts , 2015, Molecular biology and evolution.

[55]  I. Rigoutsos,et al.  Evaluation of Methods for De Novo Genome Assembly from High-Throughput Sequencing Reads Reveals Dependencies That Affect the Quality of the Results , 2011, PloS one.

[56]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[57]  Sudhir Kumar,et al.  Mutation rates in mammalian genomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[58]  T. Jeffries,et al.  Genomics and the making of yeast biodiversity. , 2015, Current opinion in genetics & development.

[59]  L. Solieri,et al.  Mitochondrial inheritance in budding yeasts: towards an integrated understanding. , 2010, Trends in microbiology.

[60]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[61]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[62]  Paramvir S. Dehal,et al.  Finished Genome of the Fungal Wheat Pathogen Mycosphaerella graminicola Reveals Dispensome Structure, Chromosome Plasticity, and Stealth Pathogenesis , 2011, PLoS genetics.

[63]  Chengxi Ye,et al.  DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies , 2014, Scientific Reports.

[64]  Antonis Rokas,et al.  Harnessing genomics for evolutionary insights. , 2009, Trends in ecology & evolution.

[65]  A. Salamov,et al.  Diverse Lifestyles and Strategies of Plant Pathogenesis Encoded in the Genomes of Eighteen Dothideomycetes Fungi , 2012, PLoS pathogens.

[66]  D. Posada,et al.  Simulation of Genome-Wide Evolution under Heterogeneous Substitution Models and Complex Multispecies Coalescent Histories , 2014, Molecular biology and evolution.

[67]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[68]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[69]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[70]  Ponnuraman Balakrishnan,et al.  Assessment of de novo assemblers for draft genomes: a case study with fungal genomes , 2014, BMC Genomics.

[71]  Antonis Rokas,et al.  Prevention, diagnosis and treatment of high‐throughput sequencing data pathologies , 2014, Molecular ecology.

[72]  Irene M Ong,et al.  Genome Sequence and Analysis of a Stress-Tolerant, Wild-Derived Strain of Saccharomyces cerevisiae Used in Biofuels Research , 2016, G3: Genes, Genomes, Genetics.

[73]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[74]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[75]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.