GR-Align: fast and flexible alignment of protein 3D structures using graphlet degree similarity

MOTIVATION Protein structure alignment is key for transferring information from well-studied proteins to less studied ones. Structural alignment identifies the most precise mapping of equivalent residues, as structures are more conserved during evolution than sequences. Among the methods for aligning protein structures, maximum Contact Map Overlap (CMO) has received sustained attention during the past decade. Yet, known algorithms exhibit modest performance and are not applicable for large-scale comparison. RESULTS Graphlets are small induced subgraphs that are used to design sensitive topological similarity measures between nodes and networks. By generalizing graphlets to ordered graphs, we introduce GR-Align, a CMO heuristic that is suited for database searches. On the Proteus_300 set (44 850 protein domain pairs), GR-Align is several orders of magnitude faster than the state-of-the-art CMO solvers Apurva, MSVNS and AlEigen7, and its similarity score is in better agreement with the structural classification of proteins. On a large-scale experiment on the Gold-standard benchmark dataset (3 207 270 protein domain pairs), GR-Align is several orders of magnitude faster than the state-of-the-art protein structure comparison tools TM-Align, DaliLite, MATT and Yakusa, while achieving similar classification performances. Finally, we illustrate the difference between GR-Align's flexible alignments and the traditional ones by querying a flexible protein in the Astral-40 database (11 154 protein domains). In this experiment, GR-Align's top scoring alignments are not only in better agreement with structural classification of proteins, but also that they allow transferring more information across proteins.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[5]  Adam Godzik,et al.  Flexible algorithm for direct multiple alignment of protein structures and sequences , 1994, Comput. Appl. Biosci..

[6]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[7]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[8]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[9]  A. Godzik The structural alignment between two proteins: Is there a unique answer? , 1996, Protein science : a publication of the Protein Society.

[10]  Pierluigi Crescenzi,et al.  How to find the best approximation results , 1998, SIGA.

[11]  Christos H. Papadimitriou,et al.  Algorithmic aspects of protein structure similarity , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[12]  Liisa Holm,et al.  DaliLite workbench for protein structure comparison , 2000, Bioinform..

[13]  R. Carr,et al.  Branch-and-Cut Algorithms for Independent Set Problems: Integrality Gap and An Application to Protein Structure Alignment , 2000 .

[14]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[15]  Alberto Caprara,et al.  Structural alignment of large—size proteins via lagrangian relaxation , 2002, RECOMB '02.

[16]  Adam Godzik,et al.  Flexible structure alignment by chaining aligned fragment pairs allowing twists , 2003, ECCB.

[17]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[18]  S. Vetter,et al.  Novel aspects of calmodulin target recognition and activation. , 2003, European journal of biochemistry.

[19]  Igor Jurisica,et al.  Modeling interactome: scale-free or geometric? , 2004, Bioinform..

[20]  Robert D. Carr,et al.  1001 Optimal PDB Structure Alignments: Integer Programming Methods for Finding the Maximum Contact Map Overlap , 2004, J. Comput. Biol..

[21]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[22]  Joel Sokol,et al.  Optimal Protein Structure Alignment Using Maximum Cliques , 2005, Oper. Res..

[23]  Yves Robert,et al.  High Performance Computing - HiPC 2006, 13th International Conference, Bangalore, India, December 18-21, 2006, Proceedings , 2006, HiPC.

[24]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[25]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[26]  Weiguo Liu,et al.  GPU-ClustalW: Using Graphics Hardware to Accelerate Multiple Sequence Alignment , 2006, HiPC.

[27]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[28]  J. Marcos Moreno-Vega,et al.  A simple and fast heuristic for protein structure comparison , 2008, BMC Bioinformatics.

[29]  Peter Lackner,et al.  Comparative Analysis of Protein Structure Alignments , 2007, BMC Structural Biology.

[30]  Michael Lappe,et al.  Joining Softassign and Dynamic Programming for the Contact Map Overlap Problem , 2007, BIRD.

[31]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[32]  Nabil H. Mustafa,et al.  Fast Molecular Shape Matching Using Contact Maps , 2007, J. Comput. Biol..

[33]  Bonnie Berger,et al.  A Parameterized Algorithm for Protein Structure Alignment , 2007, J. Comput. Biol..

[34]  Tin Wee Tan,et al.  In silico grouping of peptide/HLA class I complexes using structural interaction characteristics , 2007, Bioinform..

[35]  Natasa Przulj,et al.  Biological network comparison using graphlet degree distribution , 2007, Bioinform..

[36]  Wei Xie,et al.  A Reduction-Based Exact Algorithm for the Contact Map Overlap Problem , 2007, J. Comput. Biol..

[37]  Lenore Cowen,et al.  Matt: Local Flexibility Aids Protein Multiple Structure Alignment , 2008, PLoS Comput. Biol..

[38]  Ralf Zimmer,et al.  Systematic comparison of SCOP and CATH: a new gold standard for protein structure analysis , 2009, BMC Structural Biology.

[39]  Rumen Andonov,et al.  An Efficient Lagrangian Relaxation for the Contact Map Overlap Problem , 2008, WABI.

[40]  Gunnar W. Klau,et al.  PAUL: protein structural alignment using integer linear programming and Lagrangian relaxation , 2009, BMC Bioinformatics.

[41]  Liisa Holm,et al.  Advances and pitfalls of protein structural alignment. , 2009, Current opinion in structural biology.

[42]  Yongchao Liu,et al.  CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units , 2009, BMC Research Notes.

[43]  Wayne Hayes,et al.  Optimal Network Alignment with Graphlet Degree Vectors , 2010, Cancer informatics.

[44]  Rumen Andonov,et al.  Maximum Cliques in Protein Structure Comparison , 2009, SEA.

[45]  Pierre Baldi,et al.  A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval , 2010, Bioinform..

[46]  Piero Fariselli,et al.  Fast overlapping of protein contact maps by alignment of eigenvectors , 2010, Bioinform..

[47]  Allen Holder,et al.  A Spectral Approach to Protein Structure Alignment , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[48]  Natasa Przulj,et al.  Integrative network alignment reveals large regions of global network similarity in yeast and human , 2011, Bioinform..

[49]  Rumen Andonov,et al.  Maximum Contact Map Overlap Revisited , 2011, J. Comput. Biol..