Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences

Motivation: To assess the potential of different types of sequence data combined with de novo and hybrid assembly approaches to improve existing draft genome sequences. Results: Illumina, 454 and PacBio sequencing technologies were used to generate de novo and hybrid genome assemblies for four different bacteria, which were assessed for quality using summary statistics (e.g. number of contigs, N50) and in silico evaluation tools. Differences in predictions of multiple copies of rDNA operons for each respective bacterium were evaluated by PCR and Sanger sequencing, and then the validated results were applied as an additional criterion to rank assemblies. In general, assemblies using longer PacBio reads were better able to resolve repetitive regions. In this study, the combination of Illumina and PacBio sequence data assembled through the ALLPATHS-LG algorithm gave the best summary statistics and most accurate rDNA operon number predictions. This study will aid others looking to improve existing draft genome assemblies. Availability and implementation: All assembly tools except CLC Genomics Workbench are freely available under GNU General Public License. Contact: brownsd@ornl.gov Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Peter Williams,et al.  IMG: the integrated microbial genomes database and comparative analysis system , 2011, Nucleic Acids Res..

[2]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[3]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[4]  Tom Hsiang,et al.  A biologist's guide to de novo genome assembly using next-generation sequence data: A test with fungal genomes. , 2011, Journal of microbiological methods.

[5]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[6]  L. Pachter,et al.  CGAL: computing genome assembly likelihoods , 2013 .

[7]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[8]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[9]  Mihai Pop,et al.  Minimus: a fast, lightweight genome assembler , 2007, BMC Bioinformatics.

[10]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[11]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[12]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[13]  M. Pop,et al.  CORRESPONDENCE Open Access Correspondence Finishing genomes with limited resources: lessons from an ensemble of microbial genomes , 2022 .

[14]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[15]  Wayne Mitchell,et al.  Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia , 2014, Biotechnology for Biofuels.

[16]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[17]  Samuel A. Assefa,et al.  A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs , 2012, Nature Protocols.

[18]  Lin Liu,et al.  Comparison of Next-Generation Sequencing Systems , 2012, Journal of biomedicine & biotechnology.

[19]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[20]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[21]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[22]  Christopher A. Lepczyk,et al.  Opinions from the Front Lines of Cat Colony Management Conflict , 2012, PloS one.

[23]  Melissa Bastide,et al.  Assembling Genomic DNA Sequences with PHRAP , 2007, Current protocols in bioinformatics.

[24]  Thomas S. Brettin,et al.  The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation , 2012, PloS one.

[25]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[26]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[27]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[28]  Miriam L. Land,et al.  Twenty-One Genome Sequences from Pseudomonas Species and 19 Genome Sequences from Diverse Bacteria Isolated from the Rhizosphere and Endosphere of Populus deltoides , 2012, Journal of bacteriology.

[29]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[30]  S. Salzberg,et al.  The Value of Complete Microbial Genome Sequencing (You Get What You Pay For) , 2002, Journal of bacteriology.

[31]  James H. Bullard,et al.  A hybrid approach for the automated finishing of bacterial genomes , 2012, Nature Biotechnology.

[32]  B. Birren,et al.  Genome Project Standards in a New Era of Sequencing , 2009, Science.

[33]  Sergey Koren,et al.  Automated ensemble assembly and validation of microbial genomes , 2014, BMC Bioinformatics.

[34]  Miriam L. Land,et al.  Draft Genome Sequence of Rhizobium sp. Strain PDO1-076, a Bacterium Isolated from Populus deltoides , 2012, Journal of bacteriology.

[35]  Anthony V. Palumbo,et al.  Sequencing Intractable DNA to Close Microbial Genomes , 2012, PloS one.

[36]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[37]  Tae-Jin Oh,et al.  Advantages of Single-Molecule Real-Time Sequencing in High-GC Content Genomes , 2013, PloS one.

[38]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[39]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[40]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[41]  Jenny Shu,et al.  Efficient and accurate whole genome assembly and methylome profiling of E. coli , 2013, BMC Genomics.

[42]  Peter F. Hallin,et al.  RNAmmer: consistent and rapid annotation of ribosomal RNA genes , 2007, Nucleic acids research.

[43]  Mauricio O. Carneiro,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.

[44]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[45]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[46]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[47]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.