Single molecule sequencing-guided scaffolding and correction of draft assemblies

Although single molecule sequencing is still improving, the lengths of the generated sequences are inevitably an advantage in genome assembly. Prior work that utilizes long reads to conduct genome assembly has mostly focused on correcting sequencing errors and improving contiguity of de novo assemblies. In this paper, we propose a disassembling-reassembling approach for both correcting structural errors in the draft assembly and scaffolding a target assembly based on error-corrected single molecule sequences. In particular, the disassembling stage identifies potential structural errors in the draft assembly and disassembles the draft assembly to remove such erroneous positions, ensuring that the resulting contigs (called validated segments in this paper) are structurally consistent with the new sequence data. The reassembling stage bridges the validated segments to build larger contigs by utilizing the remaining unaligned long reads in the disassembling stage, aiming to improve the contiguity of the final assembly. To achieve this goal, we formulate a maximum alternating path cover problem. We prove that this problem is NP-hard, and solve it by a 2-approximation algorithm. Our experimental results show that our approach can improve the structural correctness of target assemblies in the cost of some contiguity, even with smaller amounts of long reads. Finally, we show that our reassembling process can also serve as a competitive scaffolder using a well-established assembly benchmark.