Combinatorial algorithms for DNA sequence assembly

The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%.

[1]  Journal of the Association for Computing Machinery , 1961, Nature.

[2]  Marvin B. Shapiro An Algorithm for Reconstructing Protein and RNA Sequences , 1967, JACM.

[3]  G. Hutchinson,et al.  Evaluation of polymer sequence fragment data using graph theory. , 1969, The Bulletin of mathematical biophysics.

[4]  E. Lawler A PROCEDURE FOR COMPUTING THE K BEST SOLUTIONS TO DISCRETE OPTIMIZATION PROBLEMS AND ITS APPLICATION TO THE SHORTEST PATH PROBLEM , 1972 .

[5]  V. Chvátal,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[6]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[7]  Harold N. Gabow,et al.  Two Algorithms for Generating Weighted Spanning Trees in Order , 1977, SIAM J. Comput..

[8]  Robert E. Tarjan,et al.  Finding optimum branchings , 1977, Networks.

[9]  T. Gingeras,et al.  Computer programs for the assembly of DNA sequences. , 1979, Nucleic acids research.

[10]  R. Polozov,et al.  On the algorithms for determining the primary structure of biopolymers. , 1979, Bulletin of mathematical biology.

[11]  R. Staden A strategy of DNA sequencing employing computer programs. , 1979, Nucleic acids research.

[12]  Francesco Maffioli,et al.  A note on finding optimum branchings , 1979, Networks.

[13]  Francesco Maffioli,et al.  The k best spanning arborescences of a network , 1980, Networks.

[14]  David Maier,et al.  On Finding Minimal Length Superstrings , 1980, J. Comput. Syst. Sci..

[15]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[16]  Hans Söderlund,et al.  Algorithms for Some String Matching Problems Arising in Molecular Genetics , 1983, IFIP Congress.

[17]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[18]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[19]  J. Gallant The complexity of the overlap method for sequencing biopolymers. , 1983, Journal of theoretical biology.

[20]  Kurt Mehlhorn,et al.  Data structures and algorithms. Volume 1 : Sorting and searching , 1984 .

[21]  Robert E. Tarjan,et al.  Fibonacci heaps and their uses in improved network optimization algorithms , 1984, JACM.

[22]  Hans Söderlund,et al.  SEQAID: a DNA sequence assembling program based on a mathematical model , 1984, Nucleic Acids Res..

[23]  Prof. Dr. Kurt Mehlhorn,et al.  Data Structures and Algorithms 1 , 1984, EATCS.

[24]  Robert E. Tarjan,et al.  Efficient algorithms for finding minimum spanning trees in undirected and directed graphs , 1986, Comb..

[25]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[26]  Esko Ukkonen,et al.  A Greedy Approximation Algorithm for Constructing Shortest Common Superstrings , 1988, Theor. Comput. Sci..

[27]  Jonathan S. Turner,et al.  Approximation Algorithms for the Shortest Common Superstring Problem , 1989, Inf. Comput..

[28]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[29]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[30]  R. Hardison,et al.  Complete nucleotide sequence of the rabbit β-like globin gene cluster: Analysis of intergenic sequences and comparison with the human β-like globin gene cluster , 1989 .

[31]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[32]  Eugene L. Lawler,et al.  Approximate string matching in sublinear expected time , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[33]  J. Kececioglu Exact and approximation algorithms for DNA sequence reconstruction , 1992 .

[34]  X. Huang,et al.  A contig assembly program based on sensitive detection of fragment overlaps. , 1992, Genomics.

[35]  Gad M. Landau,et al.  An Efficient Algorithm for the All Pairs Suffix-Prefix Problem , 1992, Inf. Process. Lett..

[36]  Paul Cull,et al.  Reconstructing sequences from shotgun data , 1993 .

[37]  Linear approximation of shortest superstrings , 1994, JACM.

[38]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[39]  Robert E. Tarjan,et al.  The pairing heap: A new form of self-adjusting heap , 2005, Algorithmica.

[40]  Esko Ukkonen A linear-time algorithm for finding approximate shortest common superstrings , 2005, Algorithmica.