On the closest string and substring problems

The problem of finding a center string that is "close" to everygiven string arises in computational molecular biology and codingtheory. This problem has two versions: the Closest String problemand the Closest Substring problem. Given a set of strings <i>S</i>= {<i>s</i><sub>1</sub>, <i>s</i><sub>2</sub>, ...,<i>s</i><sub>n</sub>}, each of length <i>m</i>, the Closest Stringproblem is to find the smallest <i>d</i> and a string s of length<i>m</i> which is within Hamming distance d to each<i>s</i><sub>i</sub> ε <i>S</i>. This problem comes fromcoding theory when we are looking for a code not too far away froma given set of codes. Closest Substring problem, with an additionalinput integer <i>L</i>, asks for the smallest d and a string<i>s</i>, of length <i>L</i>, which is within Hamming distance daway from a substring, of length <i>L</i>, of each si. This problemis much more elusive than the Closest String problem. The ClosestSubstring problem is formulated from applications in findingconserved regions, identifying genetic drug targets and generatinggenetic probes in molecular biology. Whether there are efficientapproximation algorithms for both problems are major open questionsin this area. We present two polynomial-time approximationalgorithms with approximation ratio 1 + ε for any smallε to settle both questions.

[1]  R J Roberts,et al.  Predictive motifs derived from cytosine methyltransferases. , 1989, Nucleic acids research.

[2]  Piotr Berman,et al.  A Linear-Time Algorithm for the 1-Mismatch Problem , 1997, WADS.

[3]  Prabhakar Raghavan,et al.  Probabilistic construction of deterministic algorithms: Approximating packing integer programs , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[4]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[5]  M. Waterman,et al.  Pattern recognition in several sequences: consensus and alignment. , 1984, Bulletin of mathematical biology.

[6]  G D Schuler,et al.  A workbench for multiple alignment construction and analysis , 1991, Proteins.

[7]  Bin Ma,et al.  Finding Similar Regions in Many Sequences , 2002, J. Comput. Syst. Sci..

[8]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[9]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Pavel A. Pevzner,et al.  Computational molecular biology : an algorithmic approach , 2000 .

[11]  Bin Ma,et al.  A Polynominal Time Approximation Scheme for the Closest Substring Problem , 2000, CPM.

[12]  Marek Karpinski,et al.  Polynomial time approximation schemes for dense instances of NP-hard problems , 1995, STOC '95.

[13]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[14]  M. Waterman,et al.  Line geometries for sequence comparisons , 1984 .

[15]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[16]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[17]  M S Waterman,et al.  Multiple sequence alignment by consensus. , 1986, Nucleic acids research.

[18]  Bin Ma,et al.  Distinguishing string selection problems , 2003, SODA '99.

[19]  Joaquín Dopazo,et al.  Design of primers for PCR amplification of highly variable genomes , 1993, Comput. Appl. Biosci..

[20]  K. Lucas,et al.  An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes , 1991, Comput. Appl. Biosci..

[21]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[22]  G. Stormo,et al.  Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation stati , 1995 .

[23]  Edward C. Holmes,et al.  Primer Master: a new program for the design and analysis of PCR primers , 1996, Comput. Appl. Biosci..

[24]  Andrzej Lingas,et al.  Efficient approximation algorithms for the Hamming center problem , 1999, SODA '99.

[25]  Giuseppe Lancia,et al.  Banishing Bias from Consensus Sequences , 1997, CPM.

[26]  D Gusfield,et al.  Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993, Bulletin of mathematical biology.

[27]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .