ILP-based maximum likelihood genome scaffolding

BackgroundInterest in de novo genome assembly has been renewed in the past decade due to rapid advances in high-throughput sequencing (HTS) technologies which generate relatively short reads resulting in highly fragmented assemblies consisting of contigs. Additional long-range linkage information is typically used to orient, order, and link contigs into larger structures referred to as scaffolds. Due to library preparation artifacts and erroneous mapping of reads originating from repeats, scaffolding remains a challenging problem. In this paper, we provide a scalable scaffolding algorithm (SILP2) employing a maximum likelihood model capturing read mapping uncertainty and/or non-uniformity of contig coverage which is solved using integer linear programming. A Non-Serial Dynamic Programming (NSDP) paradigm is applied to render our algorithm useful in the processing of larger mammalian genomes. To compare scaffolding tools, we employ novel quantitative metrics in addition to the extant metrics in the field. We have also expanded the set of experiments to include scaffolding of low-complexity metagenomic samples.ResultsSILP2 achieves better scalability throughg a more efficient NSDP algorithm than previous release of SILP. The results show that SILP2 compares favorably to previous methods OPERA and MIP in both scalability and accuracy for scaffolding single genomes of up to human size, and significantly outperforms them on scaffolding low-complexity metagenomic samples.ConclusionsEquipped with NSDP, SILP2 is able to scaffold large mammalian genomes, resulting in the longest and most accurate scaffolds. The ILP formulation for the maximum likelihood model is shown to be flexible enough to handle metagenomic samples.

[1]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[2]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[3]  Robert E. Tarjan,et al.  Dividing a Graph into Triconnected Components , 1973, SIAM J. Comput..

[4]  Sergey Koren,et al.  Bambus 2: scaffolding metagenomes , 2011, Bioinform..

[5]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[6]  Roberto Tamassia,et al.  On-Line Graph Algorithms with SPQR-Trees , 1990, ICALP.

[7]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[8]  Wing-Kin Sung,et al.  Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, RECOMB.

[9]  Hui Shen,et al.  Comparative studies of de novo assembly tools for next-generation sequencing technologies , 2011, Bioinform..

[10]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[11]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[12]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[13]  M. Berriman,et al.  A comprehensive evaluation of assembly scaffolding tools , 2014, Genome Biology.

[14]  J. Hofkens,et al.  Optical mapping of DNA: Single‐molecule‐based methods for mapping genomes , 2011, Biopolymers.

[15]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[16]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.

[17]  Alexander Schliep,et al.  SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding , 2012, J. Comput. Biol..

[18]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[19]  Konrad H. Paszkiewicz,et al.  De novo assembly of short sequence reads , 2010, Briefings Bioinform..

[20]  Felipe Zapata,et al.  Toward a statistically explicit understanding of de novo sequence assembly , 2013, Bioinform..

[21]  Sanjay Ranka,et al.  Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine , 2012, BIOINFORMATICS 2012.

[22]  M. Schatz,et al.  Assembly of large genomes using second-generation sequencing. , 2010, Genome research.

[23]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[24]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[25]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[26]  Ion I. Mandoiu,et al.  Scalable genome scaffolding using integer linear programming , 2012, BCB.

[27]  L. Pachter,et al.  CGAL: computing genome assembly likelihoods , 2013 .

[28]  Markus Chimani,et al.  The Open Graph Drawing Framework , 2013 .

[29]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[30]  Robin Milner,et al.  On Observing Nondeterminism and Concurrency , 1980, ICALP.

[31]  James H. Bullard,et al.  A hybrid approach for the automated finishing of bacterial genomes , 2012, Nature Biotechnology.

[32]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[33]  Oleg Shcherbina Nonserial Dynamic Programming and Tree Decomposition in Discrete Optimization , 2006, OR.

[34]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[35]  Esko Ukkonen,et al.  Fast scaffolding with small independent mixed integer programs , 2011, Bioinform..

[36]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[37]  David S. Johnson,et al.  Some simplified NP-complete problems , 1974, STOC '74.

[38]  Eugene W. Myers,et al.  The greedy path-merging algorithm for contig scaffolding , 2002, JACM.