Scalable genome scaffolding using integer linear programming

The rapidly diminishing cost of genome sequencing is driving renewed interest in large scale genome sequencing programs such as Genome 10K (G10K). Despite renewed interest the assembly of large genomes from short reads is still an extremely resource intensive process. This work presents a scalable algorithms to create scaffolds, or ordered and oriented sets of assembled contigs, which is one part of a practical assembly. This is accomplished using integer linear programming (ILP). In order to process large mammalian genomes we employ non-serial dynamic programming (NSDP) and a hierarchical strategy. Both existing and novel quantitative metrics are used to compare scaffolding tools and gain deeper insight into the challenges of scaffolding. The code is available at: https://bitbucket.org/jrl03001/silp

[1]  Konrad H. Paszkiewicz,et al.  De novo assembly of short sequence reads , 2010, Briefings Bioinform..

[2]  Robert E. Tarjan,et al.  Dividing a Graph into Triconnected Components , 1973, SIAM J. Comput..

[4]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[5]  Sergey Koren,et al.  Bambus 2: scaffolding metagenomes , 2011, Bioinform..

[6]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[9]  Dawei Li,et al.  The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[10]  Hui Shen,et al.  Comparative studies of de novo assembly tools for next-generation sequencing technologies , 2011, Bioinform..

[11]  Oleg Shcherbina Nonserial Dynamic Programming and Tree Decomposition in Discrete Optimization , 2006, OR.

[12]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[13]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[14]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[15]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[16]  Alexander Schliep,et al.  SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding , 2012, J. Comput. Biol..

[17]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[18]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[19]  Makedonka Mitreva,et al.  A vertebrate case study of the quality of assemblies derived from next-generation sequences , 2011, Genome Biology.

[20]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.

[21]  Wai-Shing Luk,et al.  Fast and lossless graph division method for layout decomposition using SPQR-tree , 2010, 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[22]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[23]  Esko Ukkonen,et al.  Fast scaffolding with small independent mixed integer programs , 2011, Bioinform..

[24]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[25]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[26]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[27]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[28]  ZVI GALIL,et al.  Efficient algorithms for finding maximum matching in graphs , 1986, CSUR.

[29]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[30]  M. Schatz,et al.  Assembly of large genomes using second-generation sequencing. , 2010, Genome research.

[31]  Wing-Kin Sung,et al.  Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, RECOMB.

[32]  Eugene W. Myers,et al.  The greedy path-merging algorithm for contig scaffolding , 2002, JACM.