Linear approximation of shortest superstrings

We consider the following problem: given a collection of strings s1,…, sm, find the shortest string s such that each si appears as a substring (a consecutive block) of s. Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of (distinct) strings with maximum overlap until only one string remains. Let n denote the length of the optimal superstring. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2n), yet the only previous nontrivial bound known for any polynomial-time algorithm is a recent O(n log n) result.We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4n. Furthermore, we present a simple modified version of the greedy algorithm that we show produces a superstring of length at most 3n. We also show the superstring problem to be MAXSNP-hard, which implies that a polynomial-time approximation scheme for this problem is unlikely.

[1]  H. Wilf,et al.  Uniqueness theorems for periodic functions , 1965 .

[2]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[3]  David Maier,et al.  On Finding Minimal Length Superstrings , 1980, J. Comput. Syst. Sci..

[4]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[5]  Hans Söderlund,et al.  Algorithms for Some String Matching Problems Arising in Molecular Genetics , 1983, IFIP Congress.

[6]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[7]  A. Hoffman,et al.  On Transportation Problems with Upper Bounds on Leading Rectangles , 1985 .

[8]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[9]  A. Lesk COMPUTATIONAL MOLECULAR BIOLOGY , 1988, Proceeding of Data For Discovery.

[10]  Esko Ukkonen,et al.  A Greedy Approximation Algorithm for Constructing Shortest Common Superstrings , 1988, Theor. Comput. Sci..

[11]  Jonathan S. Turner,et al.  Approximation Algorithms for the Shortest Common Superstring Problem , 1989, Inf. Comput..

[12]  Arthur M. Lesk Computational Molecular Biology: Sources and Methods for Sequence Analysis , 1989 .

[13]  N. Alon,et al.  An algorithm for the detection and construction of Monge sequences , 1989 .

[14]  Mihalis Yannakakis,et al.  Optimization, approximation, and complexity classes , 1991, STOC '88.

[15]  Carsten Lund,et al.  Proof verification and hardness of approximation problems , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[16]  Mihalis Yannakakis,et al.  The Traveling Salesman Problem with Distances One and Two , 1993, Math. Oper. Res..