Protein Structure Comparison: From Contact Map Overlap Maximisation to Distance-based Alignment Search Tool. (La comparaison structurale des protéines : de la maximisation du recouvrement de cartes de contacts à l'alignement basé sur les distances)

In structural biology, it is commonly admitted that the three dimensional structure of a protein determines its function. A fruitful assumption based on this paradigm is that proteins sharing close three dimensional structures may derive from the same ancestor and thus, may share similar functions. Computing the similarity between two protein structures is therefore a crucial task and has been extensively investigated. Among all the proposed methods, we focus on the similarity measure called Contact Map Overlap maximisation (CMO), mainly because it provides scores which can be used for obtaining good automatic classifications of the protein structures. In this thesis, comparing two protein structures is modelled as finding specific sub-graphs in specific $k$-partite graphs called alignment graphs, and we show that this task can be efficiently done by using advanced combinatorial optimisation techniques. In the first part of the thesis, we model CMO as a kind of maximum edge induced sub-graph problem in alignment graphs, for which we conceive an exact solver which outperforms the other CMO algorithms from the literature. Even though we succeeded to accelerate CMO, the procedure still stays too much time consuming for large database comparisons. The second part of the thesis is dedicated to further accelerate CMO by using structural biology knowledge. We propose a hierarchical approach for CMO which is based on the secondary structure of the proteins. Finally, although CMO is a very good scoring scheme, the alignments it provides frequently posses big root mean square deviation values. To overcome this weakness, in the last part of the thesis, we propose a new comparison method based on internal distances which we call DAST (for Distance-based Alignment Search Tool). It is modelled as a maximum clique problem in alignment graphs, for which we design a dedicated solver with very good performances.

[1]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[2]  Philip Wolfe,et al.  Validation of subgradient optimization , 1974, Math. Program..

[3]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[4]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[5]  Rumen Andonov,et al.  An Efficient Lagrangian Relaxation for the Contact Map Overlap Problem , 2008, WABI.

[6]  Frances M. G. Pearl,et al.  Recognizing the fold of a protein structure , 2003, Bioinform..

[7]  Rumen Andonov,et al.  Protein Threading: From Mathematical Models to Parallel Implementations , 2004, INFORMS J. Comput..

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  Stanislav Busygin,et al.  A new trust region technique for the maximum weight clique problem , 2006, Discret. Appl. Math..

[10]  Zhiping Weng,et al.  FAST: A novel protein structure alignment algorithm , 2004, Proteins.

[11]  Gunnar W. Klau,et al.  PAUL: protein structural alignment using integer linear programming and Lagrangian relaxation , 2009, BMC Bioinformatics.

[12]  R. Carr,et al.  Branch-and-Cut Algorithms for Independent Set Problems: Integrality Gap and An Application to Protein Structure Alignment , 2000 .

[13]  Christos H. Papadimitriou,et al.  Algorithmic aspects of protein structure similarity , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[14]  Egon Balas,et al.  Polyhedral methods for the maximum clique problem , 1994, Cliques, Coloring, and Satisfiability.

[15]  M. Guignard Lagrangean relaxation , 2003 .

[16]  A C May,et al.  Protein structure comparisons using a combination of a genetic algorithm, dynamic programming and least-squares minimization. , 1994, Protein engineering.

[17]  P. Argos,et al.  Knowledge‐based protein secondary structure assignment , 1995, Proteins.

[18]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[19]  A. Godzik The structural alignment between two proteins: Is there a unique answer? , 1996, Protein science : a publication of the Protein Society.

[20]  Adam Godzik,et al.  Flexible algorithm for direct multiple alignment of protein structures and sequences , 1994, Comput. Appl. Biosci..

[21]  Maolin Hu,et al.  Comparisons of Protein Structure Alignment Methods: Rigid and Flexible, Sequential and Non-Sequential , 2008, 2008 2nd International Conference on Bioinformatics and Biomedical Engineering.

[22]  Amit Singh,et al.  Protein Structure Alignment: A Comparison of Methods , 2000 .

[23]  Panos M. Pardalos,et al.  The maximum clique problem , 1994, J. Glob. Optim..

[24]  Robert D. Carr,et al.  101 optimal PDB structure alignments: a branch-and-cut algorithm for the maximum contact map overlap problem , 2001, RECOMB.

[25]  J. J. McGregor,et al.  Backtrack search algorithms and the maximal common subgraph problem , 1982, Softw. Pract. Exp..

[26]  Michael Lappe,et al.  Joining Softassign and Dynamic Programming for the Contact Map Overlap Problem , 2007, BIRD.

[27]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[28]  Patric R. J. Östergård,et al.  A fast algorithm for the maximum clique problem , 2002, Discret. Appl. Math..

[29]  F. Cohen,et al.  A surface of minimum area metric for the structural comparison of proteins. , 1996, Journal of molecular biology.

[30]  Alberto Caprara,et al.  Structural alignment of large—size proteins via lagrangian relaxation , 2002, RECOMB '02.

[31]  T L Blundell,et al.  An automatic method involving cluster analysis of secondary structures for the identification of domains in proteins , 1995, Protein science : a publication of the Protein Society.

[32]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[33]  Joël Pothier,et al.  YAKUSA: A fast structural database scanning method , 2005, Proteins.

[34]  Jean-Charles Régin,et al.  Using Constraint Programming to Solve the Maximum Clique Problem , 2003, CP.

[35]  Robert D. Carr,et al.  1001 Optimal PDB Structure Alignments: Integer Programming Methods for Finding the Maximum Contact Map Overlap , 2004, J. Comput. Biol..

[36]  J. Marcos Moreno-Vega,et al.  A simple and fast heuristic for protein structure comparison , 2008, BMC Bioinformatics.

[37]  M. Levitt,et al.  Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core , 1993, Current Biology.

[38]  Gunnar W. Klau,et al.  Aligning Protein Structures Using Distance Matrices and Combinatorial Optimization , 2009, GCB.

[39]  V. Uversky Intrinsically Disordered Proteins , 2000 .

[40]  Rumen Andonov,et al.  Maximum Contact Map Overlap Revisited , 2011, J. Comput. Biol..

[41]  M Madan Babu,et al.  Intrinsically disordered proteins. , 2012, Molecular bioSystems.

[42]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[43]  Gerard J Kleywegt,et al.  Déjà vu all over again: finding and analyzing protein structure similarities. , 2004, Structure.

[44]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[45]  J. Gibrat,et al.  Protein secondary structure assignment revisited: a detailed analysis of different assignment methods , 2005, BMC Structural Biology.

[46]  R. Lathrop The protein threading problem with sequence amino acid interaction preferences is NP-complete. , 1994, Protein engineering.

[47]  Frédéric Cazals,et al.  A note on the problem of reporting maximal cliques , 2008, Theor. Comput. Sci..

[48]  Joel Sokol,et al.  Optimal Protein Structure Alignment Using Maximum Cliques , 2005, Oper. Res..

[49]  M Levitt,et al.  Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins , 1998, Protein science : a publication of the Protein Society.

[50]  Guillaume Collet,et al.  Local protein threading by Mixed Integer Programming , 2011, Discret. Appl. Math..

[51]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[52]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[53]  Panos M. Pardalos,et al.  A branch and bound algorithm for the maximum clique problem , 1992, Comput. Oper. Res..

[54]  Wayne J. Pullan Protein Structure Alignment Using Maximum Cliques and Local Search , 2007, Australian Conference on Artificial Intelligence.

[55]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[56]  I C Lerman,et al.  Likelihood linkage analysis (LLA) classification method: an example treated by hand. , 1993, Biochimie.

[57]  Arthur M. Geoffrion,et al.  Lagrangian Relaxation for Integer Programming , 2010, 50 Years of Integer Programming.

[58]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Gregory A.Petsko and Dagmar Ringe Protein structure and function , 2003 .

[60]  Mark Gerstein,et al.  Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein Structures , 1996, ISMB.

[61]  C. Sander,et al.  The FSSP database of structurally aligned protein fold families. , 1994, Nucleic acids research.

[62]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[63]  Wei Xie,et al.  A Reduction-Based Exact Algorithm for the Contact Map Overlap Problem , 2007, J. Comput. Biol..

[64]  W R Taylor,et al.  Protein structure alignment. , 1989, Journal of molecular biology.

[65]  Rumen Andonov,et al.  Lagrangian approaches for a class of matching problems in computational biology , 2008, Comput. Math. Appl..

[66]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[67]  Richard M. Karp,et al.  The Traveling-Salesman Problem and Minimum Spanning Trees , 1970, Oper. Res..

[68]  Michael L. Fredman,et al.  On computing the length of longest increasing subsequences , 1975, Discret. Math..

[69]  Panos M. Pardalos,et al.  On maximum clique problems in very large graphs , 1999, External Memory Algorithms.

[70]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[71]  N. Metropolis,et al.  The Monte Carlo method. , 1949 .

[72]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Frances M. G. Pearl,et al.  CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures , 2007, PLoS Comput. Biol..