Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences

Scaffolding, the problem of ordering and orienting contigs, typically using paired-end reads, is a crucial step in the assembly of highquality draft genomes. Even as sequencing technologies and mate-pair protocols have improved significantly, scaffolding programs still rely on heuristics, with no gaurantees on the quality of the solution. In this work we explored the feasibility of an exact solution for scaffolding and present a first fixed-parameter tractable solution for assembly (Opera). We also describe a graph contraction procedure that allows the solution to scale to large scaffolding problems and demonstrate this by scaffolding several large real and synthetic datasets. In comparisons with existing scaffolders, Opera simultaneously produced longer and more accurate scaffolds demonstrating the utility of an exact approach. Opera also incorporates an exact quadratic programming formulation to precisely compute gap sizes.

[1]  Jens Stoye,et al.  Phylogenetic comparative assembly , 2009, Algorithms for Molecular Biology.

[2]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[3]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[4]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[5]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[6]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[7]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[8]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[9]  Mihai Pop,et al.  Shotgun Sequence Assembly , 2004, Adv. Comput..

[10]  Daniel H. Huson,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm153 Genome analysis OSLay: optimal syntenic layout of unfinished assemblies , 2022 .

[11]  Gregory A Petsko,et al.  Too big to succeed? , 2009, Genome Biology.

[12]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[13]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[14]  Yi Xing,et al.  Negative selection pressure against premature protein truncation is reduced by both alternative splicing and diploidy , 2004, Genome Biology.

[15]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[16]  Eugene W. Myers,et al.  The greedy path-merging algorithm for contig scaffolding , 2002, JACM.

[17]  Mihai Pop,et al.  Scaffolding and validation of bacterial genome assemblies using optical restriction maps , 2008, Bioinform..

[18]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[19]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.

[20]  L. Du,et al.  Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high-throughput analysis of transcriptomes and genomes , 2006, Nucleic acids research.

[21]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[22]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[23]  James B. Saxe,et al.  Dynamic-Programming Algorithms for Recognizing Small-Bandwidth Graphs in Polynomial Time , 1980, SIAM J. Algebraic Discret. Methods.

[24]  D. Haussler,et al.  Assembly of the working draft of the human genome with GigAssembler. , 2001, Genome research.

[25]  Rasmus Wernersson,et al.  FeatureMap3D—a tool to map protein features and sequence conservation onto homologous structures in the PDB , 2006, Nucleic Acids Res..

[26]  Huilin Yang,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Streptomyces mobaraensis DSM 40847, a Strain for Industrial Production of Microbial Transglutaminase , 2013, Genome Announcements.

[27]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[28]  Donald Goldfarb,et al.  A numerically stable dual method for solving strictly convex quadratic programs , 1983, Math. Program..

[29]  Wing-Kin Sung,et al.  A Genomic Survey of Positive Selection in Burkholderia pseudomallei Provides Insights into the Evolution of Accidental Virulence , 2010, PLoS pathogens.