SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

BackgroundThere is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions.FindingsTo overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome.ConclusionsBenchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.

[1]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[2]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[3]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[4]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[5]  E. Eichler,et al.  Shotgun sequence assembly and recent segmental duplications within the human genome , 2004, Nature.

[6]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.

[7]  Steven L. Salzberg,et al.  Faculty Opinions recommendation of SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. , 2013 .

[8]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[9]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[10]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[11]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[12]  Mihai Pop,et al.  Exploiting sparseness in de novo genome assembly , 2012, BMC Bioinformatics.