FinIS: Improved in silico Finishing Using an Exact Quadratic Programming Formulation

With the increased democratization of sequencing, the reliance of sequence assembly programs on heuristics is at odds with the need for black-box assembly solutions that can be used reliably by non-specialists. In this work, we present a formal definition for in silico assembly validation and finishing and explore the feasibility of an exact solution for this problem using quadratic programming (FinIS). Based on results for several real and simulated datasets, we demonstrate that FinIS validates the correctness of a larger fraction of the assembly than existing ad hoc tools. Using a test for unique optimal solutions, we show that FinIS can improve on both precision and recall values for the correctness of assembled sequences, when compared to competing programs. Source code and executables for FinIS are freely available at http://sourceforge.net/projects/finis/.

[1]  Mihai Pop,et al.  Sequencing and genome assembly using next-generation technologies. , 2010, Methods in molecular biology.

[2]  Mihai Pop,et al.  Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing , 2009, J. Comput. Biol..

[3]  Jon M. Kleinberg,et al.  Approximation algorithms for disjoint paths problems , 1996 .

[4]  David R. Karger,et al.  On approximating the longest path in a graph , 1997, Algorithmica.

[5]  Søren Brunak,et al.  Relating genomic variation to drug response in childhood acute lymphoblastic leukemia by multiplexed targeted sequencing , 2010, Genome Biology.

[6]  Mihai Pop,et al.  Scaffolding and validation of bacterial genome assemblies using optical restriction maps , 2008, Bioinform..

[7]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[8]  M. Berriman,et al.  Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps , 2010, Genome Biology.

[9]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Daniel H. Huson,et al.  48. MetaSim: A Sequencing Simulator for Genomics and Metagenomics , 2011 .

[11]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[12]  M. Pop,et al.  CORRESPONDENCE Open Access Correspondence Finishing genomes with limited resources: lessons from an ensemble of microbial genomes , 2022 .

[13]  Stephen P. Boyd,et al.  Semidefinite Programming , 1996, SIAM Rev..

[14]  Huanming Yang,et al.  Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly , 2011, Nature Biotechnology.

[15]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[16]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[17]  Isaac Y. Ho,et al.  Meraculous: De Novo Genome Assembly with Short Paired-End Reads , 2011, PloS one.

[18]  Ernest Szeto,et al.  Symbiosis insights through metagenomic analysis of a microbial consortium. , 2006, Nature Reviews Microbiology.

[19]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[20]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[21]  Yi Xing,et al.  Negative selection pressure against premature protein truncation is reduced by both alternative splicing and diploidy , 2004, Genome Biology.

[22]  C. DeLisi,et al.  Phenotypic connections in surprising places , 2010, Genome Biology.

[23]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[24]  Wing-Kin Sung,et al.  Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, J. Comput. Biol..

[25]  Mihai Pop,et al.  Assembly complexity of prokaryotic genomes using short reads , 2010, BMC Bioinformatics.

[26]  M. Baker De novo genome assembly: what every biologist should know , 2012, Nature Methods.

[27]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[28]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[29]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[30]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[31]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.