Optimal Solutions for the Closest-String Problem via Integer Programming

In this paper we study the closest-string problem (CSP), which can be defined as follows: Given a finite set = {s1, s2, ', sn} of strings, each string with length m, find a center string t of length m minimizing d, such that for every string si ∈ , dH(t, si) ≤ d. By dH(t, si) we mean the Hamming distance between t and si. This is an NP-hard problem, with applications in molecular biology and coding theory. Even though there are good approximation algorithms for this problem, and exact algorithms for instances with d constant, there are no studies trying to solve it exactly for the general case. In this paper we propose three integer-programming (IP) formulations and a heuristic, which is used to provide upper bounds on the value of an optimal solution. We report computational results of a branch-and-bound algorithm based on one of the IP formulations, and of the heuristic, executed over randomly generated instances. These results show that it is possible to solve CSP instances of moderate size to optimality.

[1]  Rolf Niedermeier,et al.  Exact Solutions for CLOSEST STRING and Related Problems , 2001, ISAAC.

[2]  G. Stormo,et al.  Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation stati , 1995 .

[3]  Andrzej Lingas,et al.  Efficient approximation algorithms for the Hamming center problem , 1999, SODA '99.

[4]  Giuseppe Lancia,et al.  Banishing Bias from Consensus Sequences , 1997, CPM.

[5]  Piotr Berman,et al.  A Linear-Time Algorithm for the 1-Mismatch Problem , 1997, WADS.

[6]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[7]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[8]  Bin Ma,et al.  On the closest string and substring problems , 2002, JACM.

[9]  Bin Ma,et al.  Distinguishing string selection problems , 2003, SODA '99.

[10]  G. Stormo,et al.  Specificity of the Mnt protein determined by binding to randomized operators. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Panos M. Pardalos,et al.  Efficient Algorithms for Similarity Search , 2001, J. Comb. Optim..

[12]  Steven Roman,et al.  Coding and information theory , 1992 .

[13]  Panos M. Pardalos,et al.  Efficient Algorithms for Local Alignment Search , 2001, J. Comb. Optim..

[14]  Michael R. Fellows,et al.  Parameterized Complexity , 1998 .

[15]  S. K. Park,et al.  Random number generators: good ones are hard to find , 1988, CACM.

[16]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.