A Relational Extension of the Notion of Motifs: Application to the Common 3D Protein Substructures Searching Problem

The geometrical configurations of atoms in protein structures can be viewed as approximate relations among them. Then, finding similar common substructures within a set of protein structures belongs to a new class of problems that generalizes that of finding repeated motifs. The novelty lies in the addition of constraints on the motifs in terms of relations that must hold between pairs of positions of the motifs. We will hence denote them as relational motifs. For this class of problems, we present an algorithm that is a suitable extension of the KMR paradigm and, in particular, of the KMRC as it uses a degenerate alphabet. Our algorithm contains several improvements that become especially useful when-as it is required for relational motifs-the inference is made by partially overlapping shorter motifs, rather than concatenating them. The efficiency, correctness and completeness of the algorithm is ensured by several non-trivial properties that are proven in this paper. The algorithm has been applied in the important field of protein common 3D substructure searching. The methods implemented have been tested on several examples of protein families such as serine proteases, globins and cytochromes P450 additionally. The detected motifs have been compared to those found by multiple structural alignments methods.

[1]  Nadia Pisanti,et al.  Incremental Inference of Relational Motifs with a Degenerate Alphabet , 2005, CPM.

[2]  Lawrence B. Holder,et al.  Knowledge discovery in molecular biology: Identifying structural regularities in proteins , 1999, Intell. Data Anal..

[3]  D L Brutlag,et al.  Modeling and superposition of multiple protein structures using affine transformations: analysis of the globins. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[4]  Henry Soldano,et al.  A new method to predict the consensus secondary structure of a set of unaligned RNA sequences , 1999, Bioinform..

[5]  Adam Godzik,et al.  Multiple flexible structure alignment using partial order graphs , 2005, Bioinform..

[6]  Nadia Pisanti,et al.  Implicit and Explicit Representation of Approximated Motifs , 2005 .

[7]  Aomar Osmani,et al.  Optimal Approach for Temporal Patterns Discovery , 2003, FLAIRS.

[8]  Laxmi Parida Pattern Discovery in Bioinformatics: Theory & Algorithms , 2007 .

[9]  Ruth Nussinov,et al.  MASS: multiple structural alignment by secondary structures , 2003, ISMB.

[10]  Mark Gerstein,et al.  Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein Structures , 1996, ISMB.

[11]  Osvaldo Olmea,et al.  MAMMOTH (Matching molecular models obtained from theory): An automated method for model comparison , 2002, Protein science : a publication of the Protein Society.

[12]  T. Blundell,et al.  Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. , 1990, Journal of molecular biology.

[13]  Alain Viari,et al.  Searching for flexible repeated patterns using a non-transitive similarity relation , 1995, Pattern Recognit. Lett..

[14]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[15]  M Levitt,et al.  Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins , 1998, Protein science : a publication of the Protein Society.

[16]  Stéphane Vialette,et al.  On the computational complexity of 2-interval pattern matching problems , 2004, Theor. Comput. Sci..

[17]  R Nussinov,et al.  Automated multiple structure alignment and detection of a common substructural motif , 2001, Proteins.

[18]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[19]  María Elena Ochagavía,et al.  Progressive combinatorial algorithm for multiple structural alignments: Application to distantly related proteins , 2004, Proteins.

[20]  H. Soldano,et al.  Finding Repeated Flexible Relational Words in Sequences , 2003 .

[21]  Jean-François Gibrat,et al.  FROST: A filter‐based fold recognition method , 2002, Proteins.

[22]  VINCENT ESCALIER,et al.  Pairwide and Multiple Identification of Three-Dimensional Common Substructures in Proteins , 1998, J. Comput. Biol..

[23]  Mark Gerstein,et al.  Using a measure of structural variation to define a core for the globins , 1995, Comput. Appl. Biosci..

[24]  S. Wodak,et al.  Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. , 1995, Protein engineering.

[25]  A M Lesk,et al.  Conservation and variability in the structures of serine proteinases of the chymotrypsin family. , 1996, Journal of molecular biology.

[26]  Jieping Ye,et al.  Approximate Multiple Protein Structure Alignment Using the Sum-of-Pairs Distance , 2004, J. Comput. Biol..

[27]  N. P. Brown,et al.  Protein structure: geometry, topology and classification , 2001 .

[28]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[29]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[30]  A. Konagurthu,et al.  MUSTANG: A multiple structural alignment algorithm , 2006, Proteins.

[31]  Ruth Nussinov,et al.  Multiple Structural Alignment and Core Detection by Geometric Hashing , 1999, ISMB.

[32]  William R. Taylor,et al.  A Protein Structure Comparison Methodology , 1996, Comput. Chem..

[33]  Thomas Lengauer,et al.  An Algorithm for Finding Maximal Common Subtopologies in a Set of Protein Structures , 1996, J. Comput. Biol..

[34]  Joachim Selbig,et al.  Analysis of protein sheet topologies by graph theoretical methods , 1992, Proteins.

[35]  Richard Bonneau,et al.  Ab initio protein structure prediction of CASP III targets using ROSETTA , 1999, Proteins.

[36]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[37]  Maxime Crochemore,et al.  A Basis of Tiling Motifs for Generating Repeated Patterns and Its Complexity for Higher Quorum , 2003, MFCS.

[38]  Dennis H. Smith,et al.  Computer-assisted examination of compounds for common three-dimensional substructures , 1983, Journal of chemical information and computer sciences.

[39]  Adam Godzik,et al.  Flexible structure alignment by chaining aligned fragment pairs allowing twists , 2003, ECCB.

[40]  Laxmi Parida,et al.  Protein folding trajectory analysis using patterned clusters , 2005, APBC.

[41]  Takeshi Kawabata,et al.  MATRAS: a program for protein 3D structure comparison , 2003, Nucleic Acids Res..

[42]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[43]  Alejandra Leo-Macias,et al.  A new progressive-iterative algorithm for multiple structure alignment , 2005, Bioinform..

[44]  Philip E. Bourne,et al.  A New Algorithm for the Alignment of Multiple Protein Structures Using Monte Carlo Optimization , 2000, Pacific Symposium on Biocomputing.

[45]  R. Estabrook,et al.  A passion for P450s (rememberances of the early history of research on cytochrome P450). , 2003, Drug metabolism and disposition: the biological fate of chemicals.

[46]  Trevor J. Hastie,et al.  Regression Analysis of Multiple Protein Structures , 1998, J. Comput. Biol..

[47]  William R. Taylor,et al.  Structure Comparison and Structure Patterns , 2000, J. Comput. Biol..

[48]  H. Wolfson,et al.  Multiple structural alignment by secondary structures: Algorithm and applications , 2003, Protein science : a publication of the Protein Society.

[49]  Peter Willett,et al.  Algorithms for the identification of three-dimensional maximal common substructures , 1987, J. Chem. Inf. Comput. Sci..

[50]  M. Lothaire,et al.  Applied Combinatorics on Words , 2005 .

[51]  Laxmi Parida,et al.  Combinatorial Pattern Discovery Approach for the Folding Trajectory Analysis of a β-Hairpin , 2005, PLoS Comput. Biol..