Separating repeats in DNA sequence assembly

One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are overlaid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.

[1]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[2]  Steven Skiena,et al.  A case study in genome-level fragment assembly , 2000, Bioinform..

[3]  Eugene W. Myers,et al.  ReAligner: a program for refining DNA sequence multi-alignments , 1997, RECOMB '97.

[4]  John D. Kececioglu,et al.  Aligning Alignments , 1998, CPM.

[5]  Xiaoqiu Huang,et al.  Performance of the CAP2 sequence assembly program , 1998, Mathematical Support for Molecular Biology.

[6]  John D. Kececioglu,et al.  Inferring a DNA Sequence from Erroneous Copies , 1997, Theor. Comput. Sci..

[7]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[8]  C. L. Liu,et al.  Introduction to Combinatorial Mathematics. , 1971 .

[9]  João Meidanis A simple toolkit for DNA fragment assembly , 1998, Mathematical Support for Molecular Biology.

[10]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[11]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[12]  S. Rao Kosaraju,et al.  Large-scale assembly of DNA strings and space-efficient construction of suffix trees , 1995, STOC '95.

[13]  William H. Press,et al.  Numerical recipes in C , 2002 .

[14]  Eugene W. Myers,et al.  Algorithms for whole genome shotgun sequencing , 1999, RECOMB.

[15]  Steven Skiena,et al.  Trie-Based Data Structures for Sequence Assembly , 1997, CPM.

[16]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[17]  Hans Söderlund,et al.  Algorithms for Some String Matching Problems Arising in Molecular Genetics , 1983, IFIP Congress.

[18]  X. Huang,et al.  An improved sequence assembly program. , 1996, Genomics.