SOPRA: Scaffolding algorithm for paired reads via statistical optimization

BackgroundHigh throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, de novo assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome.ResultsWe have developed SOPRA, a tool designed to exploit the mate pair/paired-end information for assembly of short reads. The main focus of the algorithm is selecting a sufficiently large subset of simultaneously satisfiable mate pair constraints to achieve a balance between the size and the quality of the output scaffolds. Scaffold assembly is presented as an optimization problem for variables associated with vertices and with edges of the contig connectivity graph. Vertices of this graph are individual contigs with edges drawn between contigs connected by mate pairs. Similar graph problems have been invoked in the context of shotgun sequencing and scaffold building for previous generation of sequencing projects. However, given the error-prone nature of HTS data and the fundamental limitations from the shortness of the reads, the ad hoc greedy algorithms used in the earlier studies are likely to lead to poor quality results in the current context. SOPRA circumvents this problem by treating all the constraints on equal footing for solving the optimization problem, the solution itself indicating the problematic constraints (chimeric/repetitive contigs, etc.) to be removed. The process of solving and removing of constraints is iterated till one reaches a core set of consistent constraints. For SOLiD sequencer data, SOPRA uses a dynamic programming approach to robustly translate the color-space assembly to base-space. For assessing the quality of an assembly, we report the no-match/mismatch error rate as well as the rates of various rearrangement errors.ConclusionsApplying SOPRA to real data from bacterial genomes, we were able to assemble contigs into scaffolds of significant length (N50 up to 200 Kb) with very few errors introduced in the process. In general, the methodology presented here will allow better scaffold assemblies of any type of mate pair sequencing data.

[1]  F. Barahona On the computational complexity of Ising spin glass models , 1982 .

[2]  Jonathan D. G. Jones,et al.  Application of 'next-generation' sequencing technologies to microbial genetics , 2009, Nature Reviews Microbiology.

[3]  Rhys A. Farrer,et al.  De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads. , 2009, FEMS microbiology letters.

[4]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[5]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[6]  J. Laurie Snell,et al.  Markov Random Fields and Their Applications , 1980 .

[7]  Todd P. Michael,et al.  Filtering error from SOLiD Output , 2010, Bioinform..

[8]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[9]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[10]  Steven Salzberg,et al.  Beware of mis-assembled genomes , 2005, Bioinform..

[11]  Catherine A. Schevon,et al.  Optimization by simulated annealing: An experimental evaluation , 1984 .

[12]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[13]  F. Guerra Spin Glasses , 2005, cond-mat/0507581.

[14]  Francisco M. De La Vega,et al.  Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. , 2009, Genome research.

[15]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[16]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[17]  Eugene W. Myers,et al.  The greedy path-merging algorithm for contig scaffolding , 2002, JACM.

[18]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[19]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[20]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[21]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[22]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[23]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[24]  Emden R. Gansner,et al.  An open graph visualization system and its applications to software engineering , 2000, Softw. Pract. Exp..

[25]  Cecilia R. Aragon,et al.  Optimization by Simulated Annealing: An Experimental Evaluation; Part I, Graph Partitioning , 1989, Oper. Res..

[26]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[27]  Steven Salzberg,et al.  Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads , 2008, PLoS Comput. Biol..

[28]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[29]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[30]  BMC Bioinformatics , 2005 .

[31]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[32]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[33]  Emile H. L. Aarts,et al.  Simulated Annealing: Theory and Applications , 1987, Mathematics and Its Applications.

[34]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..