SCARPA: scaffolding reads with practical algorithms

MOTIVATION Scaffolding is the process of ordering and orienting contigs produced during genome assembly. Accurate scaffolding is essential for finishing draft assemblies, as it facilitates the costly and laborious procedures needed to fill in the gaps between contigs. Conventional formulations of the scaffolding problem are intractable, and most scaffolding programs rely on heuristic or approximate solutions, with potentially exponential running time. RESULTS We present SCARPA, a novel scaffolder, which combines fixed-parameter tractable and bounded algorithms with Linear Programming to produce near-optimal scaffolds. We test SCARPA on real datasets in addition to a simulated diploid genome and compare its performance with several state-of-the-art scaffolders. We show that SCARPA produces longer or similar length scaffolds that are highly accurate compared with other scaffolders. SCARPA is also capable of detecting misassembled contigs and reports them during scaffolding. AVAILABILITY SCARPA is open source and available from http://compbio.cs.toronto.edu/scarpa.

[1]  Wing-Kin Sung,et al.  Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, J. Comput. Biol..

[2]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[3]  Xuemin Lin,et al.  A Fast and Effective Heuristic for the Feedback Arc Set Problem , 1993, Inf. Process. Lett..

[4]  Esko Ukkonen,et al.  Fast scaffolding with small independent mixed integer programs , 2011, Bioinform..

[5]  Richard M. Karp,et al.  Reducibility among combinatorial problems" in complexity of computer computations , 1972 .

[6]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[7]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .

[8]  Eugene W. Myers,et al.  The greedy path-merging algorithm for contig scaffolding , 2002, JACM.

[9]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[10]  Raymond E. Miller,et al.  Complexity of Computer Computations , 1972 .

[11]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[12]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.

[13]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[14]  Peter Eades A FAST EFFECTIVE HEURISTIC FOR THE FEEDBACK ARC SET PROBLEM , 1993 .

[15]  Nilgun Donmez,et al.  Hapsembler: An Assembler for Highly Polymorphic Genomes , 2011, RECOMB.

[16]  Bruce A. Reed,et al.  Finding odd cycle transversals , 2004, Oper. Res. Lett..

[17]  A. Nijenhuis Combinatorial algorithms , 1975 .

[18]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[19]  Saket Saurabh,et al.  Simpler Parameterized Algorithm for OCT , 2009, IWOCA.

[20]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[21]  Steven J. M. Jones,et al.  De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data , 2009, Genome Biology.

[22]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .