Distinguishing string selection problems

This paper presents a collection of string algorithms that are at the core of several biological problems such as discovering potential drug targets, creating diagnostic probes, universal primers or unbiased consensus sequences. All these problems reduce to the task of finding a pattern that, with some error, occurs in one set of strings (Closest Substring Problem) and does not occur in another set (Farthest String Problem). In this paper, we break down the problem into several subproblems and prove the following results. 1. The following are all NP-Hard: the Farthest String Problem, the Closest Substring Problem, and the Closest String Problem of finding a string that is close to each string in a set. 2. There is a PTAS for the Farthest String Problem based on a linear programming relaxation technique. 3. There is a polynomial-time (4/3 + e)-approximation algorithm for the Closest String Problem for any small constant e > 0. Using this algorithm, we also provide an efficient heuristic algorithm for the Closest Substring Problem. 4. The problem of finding a string that is at least Hamming distance d from as many strings in a set as possible, cannot be approximated within ne in polynomial time for some fixed constant e unless NP = P, where n is the number of strings in the set. 5. There is a polynomial-time 2-approximation for finding a string that is both the Closest Substring to one set, and the Farthest String from another set.

[1]  G. Leonard,et al.  Influence of pH on the conformation and stability of mismatch base-pairs in DNA. , 1990, Journal of molecular biology.

[2]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[3]  J. Wetmur DNA probes: applications of the principles of nucleic acid hybridization. , 1991, Critical reviews in biochemistry and molecular biology.

[4]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[5]  A. Macario,et al.  Gene Probes for Bacteria , 1990 .

[6]  Stanley T. Crooke,et al.  Antisense Research and Applications , 1993 .

[7]  K. Lucas,et al.  An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes , 1991, Comput. Appl. Biosci..

[8]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[9]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[10]  J. R. Fresco,et al.  Structural and energetic consequences of noncomplementary base oppositions in nucleic acid helices. , 1975, Progress in nucleic acid research and molecular biology.

[11]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[12]  Joaquín Dopazo,et al.  Design of primers for PCR amplification of highly variable genomes , 1993, Comput. Appl. Biosci..

[13]  Andrzej Lingas,et al.  Efficient approximation algorithms for the Hamming center problem , 1999, SODA '99.

[14]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[15]  Prabhakar Raghavan Randomized Approximation Algorithms in Combinatorial Optimization , 1994, FSTTCS.

[16]  Richard M. Karp,et al.  Complexity of Computation , 1974 .

[17]  Giuseppe Lancia,et al.  Banishing Bias from Consensus Sequences , 1997, CPM.

[18]  F. Ayala Molecular systematics , 2004, Journal of Molecular Evolution.

[19]  Edward C. Holmes,et al.  Primer Master: a new program for the design and analysis of PCR primers , 1996, Comput. Appl. Biosci..

[20]  Minoru Ito,et al.  Polynomial-Time Algorithms for Computing Characteristic Strings , 1994, CPM.

[21]  Ronald Fagin Generalized first-order spectra, and polynomial. time recognizable sets , 1974 .

[22]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[23]  F. Sperling Molecular Systematics, 2nd ed. , 1997 .

[24]  Edoardo Amaldi,et al.  The Complexity and Approximability of Finding Maximum Feasible Subsystems of Linear Relations , 1995, Theor. Comput. Sci..

[25]  Mihalis Yannakakis,et al.  Optimization, approximation, and complexity classes , 1991, STOC '88.

[26]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[27]  C. E. Longfellow,et al.  Thermodynamic and spectroscopic study of bulge loops in oligoribonucleotides. , 1990, Biochemistry.

[28]  Carsten Lund,et al.  Proof verification and the intractability of approximation problems , 1992, FOCS 1992.