Single molecule sequencing-guided scaffolding and correction of draft assemblies

BackgroundAlthough single molecule sequencing is still improving, the lengths of the generated sequences are inevitably an advantage in genome assembly. Prior work that utilizes long reads to conduct genome assembly has mostly focused on correcting sequencing errors and improving contiguity of de novo assemblies.ResultsWe propose a disassembling-reassembling approach for both correcting structural errors in the draft assembly and scaffolding a target assembly based on error-corrected single molecule sequences. To achieve this goal, we formulate a maximum alternating path cover problem. We prove that this problem is NP-hard, and solve it by a 2-approximation algorithm.ConclusionsOur experimental results show that our approach can improve the structural correctness of target assemblies in the cost of some contiguity, even with smaller amounts of long reads. In addition, our reassembling process can also serve as a competitive scaffolder relative to well-established assembly benchmarks.

[1]  Sergey Koren,et al.  Automated ensemble assembly and validation of microbial genomes , 2014, BMC Bioinformatics.

[2]  Dawn H. Nagel,et al.  The B73 Maize Genome: Complexity, Diversity, and Dynamics , 2009, Science.

[3]  Hideki Hirakawa,et al.  GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments , 2015, Bioinform..

[4]  Hans Lehrach,et al.  The Saccharomyces cerevisiae W303-K6001 cross-platform genome sequence: insights into ancestry and physiology of a laboratory mutt , 2012, Open Biology.

[5]  Max A. Alekseyev,et al.  Multi-genome Scaffold Co-assembly Based on the Analysis of Gene Orders and Genomic Repeats , 2016, ISBRA.

[6]  P. Pevzner,et al.  Breakpoint graphs and ancestral genome reconstructions. , 2009, Genome research.

[7]  Marcel J. T. Reinders,et al.  Integrating genome assemblies with MAIA , 2010, Bioinform..

[8]  Ilan Newman,et al.  Approximation algorithms for covering a graph by vertex-disjoint paths of maximum total weight , 1990, Networks.

[9]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[10]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[11]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[12]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[13]  Péter Kovács,et al.  LEMON - an Open Source C++ Graph Template Library , 2011, WGT@ETAPS.

[14]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[15]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[16]  David Sankoff,et al.  Multiple Genome Rearrangement and Breakpoint Phylogeny , 1998, J. Comput. Biol..

[17]  Rajiv C. McCoy,et al.  Illumina TruSeq Synthetic Long-Reads Empower De Novo Assembly and Resolve Complex, Highly-Repetitive Transposable Elements , 2014, bioRxiv.

[18]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[19]  Walter Pirovano,et al.  SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information , 2014, BMC Bioinformatics.

[20]  Igor Mandric,et al.  ScaffMatch: Scaffolding Algorithm Based on Maximum Weight Matching , 2015, RECOMB.

[21]  Pietro Liò,et al.  MeDuSa: a multi-draft based scaffolder , 2015, Bioinform..

[22]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[23]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[24]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[25]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[26]  Michael C. Schatz,et al.  Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly of a Eukaryotic Genome , 2015 .

[27]  Richard Hall,et al.  BIGMAC : breaking inaccurate genomes and merging assembled contigs for long read metagenomic assembly , 2016, BMC Bioinformatics.