Exact and approximation algorithms for DNA sequence reconstruction

The DNA sequence in every human being is a text of three billion characters from a four letter alphabet; determining this sequence is a major project in molecular biology. The fundamental task biologists face is to reconstruct a long sequence given short fragments from unknown locations. These fragments contain errors, and may represent the sequence on one strand of the double-helix, or the reverse complement sequence on the other strand. The Sequence Reconstruction Problem is, given a collection ${\cal F}$ of fragment sequences and an error rate $0\le\epsilon < 1$, find a shortest sequence ${\cal S}$ such that every fragment $F\in{\cal F}$, or its reverse complement, matches a substring of ${\cal S}$ with at most $\epsilon\vert F\vert$ errors. Sequence Reconstruction is ${\cal NP}$-complete. We decompose the problem into (1) constructing a graph of approximate overlaps between pairs of fragments, (2) selecting a set of overlaps of maximum total weight that induce a consistent layout of the fragments, (3) merging the overlaps into a multiple sequence alignment and voting on a consensus. A solution to (1) through (3) yields a reconstructed sequence feasible at error rate ${2\epsilon\over 1-\epsilon}$ and at most a factor ${1\over 1-\epsilon}$ longer than the shortest reconstruction, given some assumptions on fragment error. We define a measure of the overlap in a reconstruction, show that maximizing the overlap minimizes the length, and that approximating (2) within a factor of $\alpha$ approximates Sequence Reconstruction within a factor of $(1-\epsilon)\alpha$ under the overlap measure. We construct the overlap graph for (1) in $O(\epsilon N\sp2$) time given fragments of total length N at error rate $\epsilon$. We develop two exact and two approximation algorithms for (2). Our best exact algorithm computes an optimal layout for a graph of E overlaps and V fragments in $O(K(E + V$ log $V))$ time, where $K\le 2\sp{E}$ is the size of the branch-and-bound search tree. Our best approximation algorithm computes a layout with overlap at least ${1\over2}$ the maximum in $O(V(E + V$ log $V$)log $V)$ time. This is the first treatment of Sequence Reconstruction with inexact data and unknown complementarity.