Lagrangian Relaxation - Solving NP-hard Problems in Computational Biology via Combinatorial Optimization

This thesis is devoted to two $\mathcal{NP}$-complete combinatorial optimization problems arising in computational biology, the well-studied \emph{multiple sequence alignment} problem and the new formulated \emph{interval constraint coloring} problem. It shows that advanced mathematical programming techniques are capable of solving large scale real-world instances from biology to optimality. Furthermore, it reveals alternative methods that provide approximate solutions. In the first part of the thesis, we present a \emph{Lagrangian relaxation} approach for the multiple sequence alignment (MSA) problem. The multiple alignment is one common mathematical abstraction of the comparison of multiple biological sequences, like DNA, RNA, or protein sequences. If the weight of a multiple alignment is measured by the sum of the projected pairwise weights of all pairs of sequences in the alignment, then finding a multiple alignment of maximum weight is $\mathcal{NP}$-complete if the number of sequences is not fixed. The majority of the available tools for aligning multiple sequences implement heuristic algorithms; no current exact method is able to solve moderately large instances or instances involving sequences exhibiting a lower degree of similarity. We present a branch-and-bound (B\&B) algorithm for the MSA problem.\ignore{the multiple sequence alignment problem.} We approximate the optimal integer solution in the nodes of the B\&B tree by a Lagrangian relaxation of an ILP formulation for MSA relative to an exponential large class of inequalities, that ensure that all pairwise alignments can be incorporated to a multiple alignment. By lifting these constraints prior to dualization the Lagrangian subproblem becomes an \emph{extended pairwise alignment} (EPA) problem: Compute the longest path in an acyclic graph, that is penalized a charge for entering ``obstacles''. We describe an efficient algorithm that solves the EPA problem repetitively to determine near-optimal \emph{Lagrangian multipliers} via subgradient optimization. The reformulation of the dualized constraints with respect to additionally introduced variables improves the convergence rate dramatically. We account for the exponential number of dualized constraints by starting with an empty \emph{constraint pool} in the first iteration to which we add cuts in each iteration, that are most violated by the convex combination of a small number of preceding Lagrangian solutions (including the current solution). In this \emph{relax-and-cut} scheme, only inequalities from the constraint pool are dualized. The interval constraint coloring problem appears in the interpretation of experimental data in biochemistry. Monitoring hydrogen-deuterium exchange rates via mass spectroscopy is a method used to obtain information about protein tertiary structure. The output of these experiments provides aggregate data about the exchange rate of residues in overlapping fragments of the protein backbone. These fragments must be re-assembled in order to obtain a global picture of the protein structure. The interval constraint coloring problem is the mathematical abstraction of this re-assembly process. The objective of the interval constraint coloring problem is to assign a color (exchange rate) to a set of integers (protein residues) such that a set of constraints is satisfied. Each constraint is made up of a closed interval (protein fragment) and requirements on the number of elements in the interval that belong to each color class (exchange rates observed in the experiments). We introduce a polyhedral description of the interval constraint coloring problem, which serves as a basis to attack the problem by integer linear programming (ILP) methods and tools, which perform well in practice. Since the goal is to provide biochemists with all possible candidate solutions, we combine related solutions to equivalence classes in an improved ILP formulation in order to reduce the running time of our enumeration algorithm. Moreover, we establish the polynomial-time solvability of the two-color case by the integrality of the linear programming relaxation polytope $\mathcal{P}$, and also present a combinatorial polynomial-time algorithm for this case. We apply this algorithm as a subroutine to approximate solutions to instances with arbitrary but fixed number of colors and achieve an order of magnitude improvement in running time over the (exact) ILP approach. We show that the problem is $\mathcal{NP}$-complete for arbitrary number of colors, and we provide algorithms that, given an instance with $\mathcal{P}\neq\emptyset$, find a coloring that satisfies all the coloring requirements within $\pm 1$ of the prescribed value. In light of our $\mathcal{NP}$-completeness result, this is essentially the best one can hope for. Our approach is based on polyhedral theory and randomized rounding techniques. In practice, data emanating from the experiments are noisy, which normally causes the instance to be infeasible, and, in some cases, even forces $\mathcal{P}$ to be empty. To deal with this problem, the objective of the ILP is to minimize the total sum of absolute deviations from the coloring requirements over all intervals. The combinatorial approach for the two-color case optimizes the same objective function. Furthermore, we use this combinatorial method to compute, in a Lagrangian way, a bound on the minimum total error, which is exploited in a branch-and-bound manner to determine all optimal colorings. Alternatively, we study a variant of the problem in which we want to maximize the number of requirements that are satisfied. We prove that this variant is $\mathcal{APX}$-hard even in the two-color case and thus does not admit a polynomial time approximation scheme (PTAS) unless $\mathcal{P}=\mathcal{NP}$. Therefore, we slightly (by a factor of $(1+\epsilon)$) relax the condition on when a requirement is satisfied and propose a \emph{quasi-polynomial time approximation scheme} (QPTAS) which finds a coloring that ``satisfies'' the requirements of as many intervals as possible.

[1]  Ernst Althaus,et al.  Approximating the Interval Constrained Coloring Problem , 2008, SWAT.

[2]  Eugene L. Lawler,et al.  Approximation Algorithms for Multiple Sequence Alignment , 1994, Theor. Comput. Sci..

[3]  Khaled M. Elbassioni,et al.  On the approximability of the maximum feasible subsystem problem with 0/1-coefficients , 2009, SODA.

[4]  Winfried Just,et al.  Computational Complexity of Multiple Sequence Alignment with SP-Score , 2001, J. Comput. Biol..

[5]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[6]  Yan Zhang,et al.  A Quasi-PTAS for Profit-Maximizing Pricing on Line Graphs , 2007, ESA.

[7]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[8]  David Eppstein,et al.  Sequence Comparison with Mixed Convex and Concave Costs , 1990, J. Algorithms.

[9]  Claude Lemaréchal,et al.  Lagrangian Relaxation , 2000, Computational Combinatorial Optimization.

[10]  Ernst Althaus,et al.  Aligning Multiple Sequences by Cutting Planes , 2006 .

[11]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[12]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[13]  Rajiv Gandhi,et al.  Dependent rounding and its applications to approximation algorithms , 2006, JACM.

[14]  Huimin Zhang,et al.  Computing H/D-exchange speeds of single residues from data of peptic fragments , 2008, SAC '08.

[15]  Ernst Althaus,et al.  A Lagrangian relaxation approach for the multiple sequence alignment problem , 2008, J. Comb. Optim..

[16]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[17]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[18]  Matteo Fischetti,et al.  A Heuristic Method for the Set Covering Problem , 1999, Oper. Res..

[19]  S. Grimwade Recombinant DNA , 1977, Nature.

[20]  P. Prevelige,et al.  Mapping of protein:protein contact surfaces by hydrogen/deuterium exchange, followed by on-line high-performance liquid chromatography-electrospray ionization Fourier-transform ion-cyclotron-resonance mass analysis. , 2002, Journal of chromatography. A.

[21]  Mihalis Yannakakis,et al.  Optimization, approximation, and complexity classes , 1991, STOC '88.

[22]  S. Guan,et al.  Enhancement of the effective resolution of mass spectra of high-mass biomolecules by maximum entropy-based deconvolution to eliminate the isotopic natural abundance distribution , 1997 .

[23]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[24]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[25]  Roland Wunderling,et al.  Paralleler und objektorientierter Simplex-Algorithmus , 1996 .

[26]  Takeaki Uno A Fast Algorithm for Enumerating Bipartite Perfect Matchings , 2001, ISAAC.

[27]  V. Anderson,et al.  Identification of the sites of hydroxyl radical reaction with peptides by hydrogen/deuterium exchange: prevalence of reactions with the side chains. , 2000, Biochemistry.

[28]  William J. Cook,et al.  Solution of a Large-Scale Traveling-Salesman Problem , 1954, 50 Years of Integer Programming.

[29]  Sandeep K. Gupta,et al.  Improving the Practical Space and Time Efficiency of the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment , 1995, J. Comput. Biol..

[30]  Knut Reinert,et al.  The Practical Use of the A* Algorithm for Exact Multiple Sequence Alignment , 2000, J. Comput. Biol..

[31]  C. Reeves Modern heuristic techniques for combinatorial problems , 1993 .

[32]  Ernst Althaus,et al.  LASA: A Tool for Non-heuristic Alignment of Multiple Sequences , 2008, BIRD.

[33]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[34]  Johan Håstad,et al.  Some optimal inapproximability results , 2001, JACM.

[35]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[36]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[37]  Richard M. Karp,et al.  The Traveling-Salesman Problem and Minimum Spanning Trees , 1970, Oper. Res..

[38]  H. P. Williams THEORY OF LINEAR AND INTEGER PROGRAMMING (Wiley-Interscience Series in Discrete Mathematics and Optimization) , 1989 .

[39]  P. Prevelige,et al.  Identification of subunit-subunit interactions in bacteriophage P22 procapsids by chemical cross-linking and mass spectrometry. , 2006, Journal of proteome research.

[40]  E. Zuiderweg,et al.  Mapping protein-protein interactions in solution by NMR spectroscopy. , 2002, Biochemistry.

[41]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[42]  R. Hettich,et al.  Analysis of protein solvent accessible surfaces by photochemical oxidation and mass spectrometry. , 2004, Analytical chemistry.

[43]  Knut Reinert,et al.  A polyhedral approach to sequence alignment problems , 2000, Discret. Appl. Math..

[44]  Michael Jünger,et al.  SCIL - Symbolic Constraints in Integer Linear Programming , 2002, ESA.

[45]  Richard M. Karp,et al.  The traveling-salesman problem and minimum spanning trees: Part II , 1971, Math. Program..

[46]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[47]  Tobias Achterberg,et al.  SCIP - a framework to integrate Constraint and Mixed Integer Programming , 2004 .

[48]  J. Leite,et al.  Probing the topology of the glycine receptor by chemical modification coupled to mass spectrometry. , 2002, Biochemistry.

[49]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[50]  Burkhard Morgenstern,et al.  DIALIGN: multiple DNA and protein sequence alignment at BiBiServ , 2004, Nucleic Acids Res..

[51]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[52]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Kurt Mehlhorn,et al.  LEDA: a platform for combinatorial and geometric computing , 1997, CACM.

[54]  Marshall L. Fisher Comments on "The Lagrangian Relaxation Method for Solving Integer Programming Problems" , 2004, Manag. Sci..