Discrimination between distant homologs and structural analogs: lessons from manually constructed, reliable data sets.

A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.

[1]  C. Bugg,et al.  Comparison of the three-dimensional structures of human, yeast, and oat ubiquitin. , 1987, The Journal of biological chemistry.

[2]  Nick V. Grishin,et al.  MALISAM: a database of structurally analogous motifs in proteins , 2007, Nucleic Acids Res..

[3]  Lvek,et al.  Evolution of protein structures and functions , 2022 .

[4]  J. Löwe,et al.  The crystal structure of ZapA and its modulation of FtsZ polymerisation. , 2004, Journal of molecular biology.

[5]  R F Doolittle,et al.  Similar amino acid sequences revisited. , 1989, Trends in biochemical sciences.

[6]  Liisa Holm,et al.  Identification of homology in protein structure classification , 2001, Nature Structural Biology.

[7]  L. Aravind,et al.  The prokaryotic antecedents of the ubiquitin-signaling system and the early evolution of ubiquitin-like β-grasp domains , 2006, Genome Biology.

[8]  N. Grishin Fold change in evolution of protein structures. , 2001, Journal of structural biology.

[9]  M. Sternberg,et al.  Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. , 1997, Journal of molecular biology.

[10]  A. Mclachlan Gene duplications in the structural evolution of chymotrypsin. , 1979, Journal of molecular biology.

[11]  Thomas Madej,et al.  Analysis of protein homology by assessing the (dis)similarity in protein loop regions , 2004, Proteins.

[12]  J. Gergen,et al.  DNA-binding by Ig-fold proteins , 2001, Nature Structural Biology.

[13]  C. Orengo,et al.  One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. , 2002, Journal of molecular biology.

[14]  H. Schindelin,et al.  Crystal structure of molybdopterin synthase and its evolutionary relationship to ubiquitin activation , 2001, Nature Structural Biology.

[15]  SödingJohannes Protein homology detection by HMM--HMM comparison , 2005 .

[16]  R B Russell,et al.  Identification of distant homologues of fibroblast growth factors suggests a common ancestor for all beta-trefoil proteins. , 2000, Journal of molecular biology.

[17]  I. Rayment,et al.  Molecular structure of the oxidized, recombinant, heterocyst [2Fe-2S] ferredoxin from Anabaena 7120 determined to 1.7-A resolution. , 1993, Biochemistry.

[18]  A. Wand,et al.  Structure of a de novo designed protein model of radical enzymes. , 2002, Journal of the American Chemical Society.

[19]  N. Grishin,et al.  Structurally analogous proteins do exist! , 2004, Structure.

[20]  B. Vestergaard,et al.  Bacterial polypeptide release factor RF2 is structurally distinct from eukaryotic eRF1. , 2001, Molecular cell.

[21]  Karl Edman,et al.  X-ray snapshots of serine protease catalysis reveal a tetrahedral intermediate , 2001, Nature Structural Biology.

[22]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[23]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[24]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Johannes Söding,et al.  Evolution of the β‐propeller fold , 2008, Proteins.

[26]  I. Bertini,et al.  The solution structure of parsley [2Fe-2S]ferredoxin. , 1998, European journal of biochemistry.

[27]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[28]  Youngchang Kim,et al.  The crystal structure of Aq_328 from the hyperthermophilic bacteria Aquifex aeolicus shows an ancestral histone fold , 2005, Proteins.

[29]  Roland L Dunbrack,et al.  Scoring profile‐to‐profile sequence alignments , 2004, Protein science : a publication of the Protein Society.

[30]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[31]  Nick V. Grishin,et al.  MALIDUP: A database of manually , 2007 .

[32]  Michael I. Wilson,et al.  PB1 domain-mediated heterodimerization in NADPH oxidase and signaling complexes of atypical protein kinase C with Par6 and p62. , 2003, Molecular cell.

[33]  Olivier Lichtarge,et al.  Getting past appearances: the many-fold consequences of remote homology , 2001, Nature Structural Biology.

[34]  L. Aravind,et al.  Small but versatile: the extraordinary functional and structural diversity of the β-grasp fold , 2007, Biology Direct.

[35]  Adam Zemla,et al.  LGA: a method for finding 3D similarities in protein structures , 2003, Nucleic Acids Res..

[36]  Nick V Grishin,et al.  DOM‐fold: A structure with crossing loops found in DmpA, ornithine acetyltransferase, and molybdenum cofactor‐binding domain , 2005, Protein science : a publication of the Protein Society.

[37]  A. Torres-Larios,et al.  Achieving error-free translation; the mechanism of proofreading of threonyl-tRNA synthetase at atomic resolution. , 2004, Molecular cell.

[38]  L. Nicholson,et al.  Solution structure of ThiS and implications for the evolutionary roots of ubiquitin , 2001, Nature Structural Biology.

[39]  Parantu K. Shah,et al.  Structural similarity to bridge sequence space: Finding new families on the bridges , 2005, Protein science : a publication of the Protein Society.

[40]  A. Godzik The structural alignment between two proteins: Is there a unique answer? , 1996, Protein science : a publication of the Protein Society.

[41]  S. Bryant,et al.  Identification of homologous core structures , 1999, Proteins.

[42]  C Sander,et al.  Dictionary of recurrent domains in protein structures , 1998, Proteins.

[43]  H. Schindelin,et al.  The crystal structure of Escherichia coli MoeA and its relationship to the multifunctional protein gephyrin. , 2001, Structure.

[44]  P. Kraulis A program to produce both detailed and schematic plots of protein structures , 1991 .

[45]  A. Murzin How far divergent evolution goes in proteins. , 1998, Current opinion in structural biology.

[46]  C. Ehresmann,et al.  The Structure of Threonyl-tRNA Synthetase-tRNAThr Complex Enlightens Its Repressor Activity and Reveals an Essential Zinc Ion in the Active Site , 1999, Cell.

[47]  Johannes Söding,et al.  On the origin of the histone fold , 2007, BMC Structural Biology.

[48]  O. Ptitsyn,et al.  Why do globular proteins fit the limited set of folding patterns? , 1987, Progress in biophysics and molecular biology.

[49]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[50]  J A Lake,et al.  An ancestral nuclear protein assembly: Crystal structure of the Methanopyrus kandleri histone , 2001, Protein science : a publication of the Protein Society.

[51]  P. Bork,et al.  Homology among (betaalpha)(8) barrels: implications for the evolution of metabolic pathways. , 2000, Journal of molecular biology.

[52]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[53]  Jeffrey J. Wilson,et al.  A bacterial collagen‐binding domain with novel calcium‐binding motif controls domain orientation , 2003, The EMBO journal.

[54]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[55]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[56]  E V Koonin,et al.  Evolution of aminoacyl-tRNA synthetases--analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events. , 1999, Genome research.

[57]  C. Ponting,et al.  A β‐propeller domain within TolB , 1999, Molecular microbiology.

[58]  L. Aravind,et al.  The many faces of the helix-turn-helix domain: transcription regulation and beyond. , 2005, FEMS microbiology reviews.

[59]  Y. Kyōgoku,et al.  Structure of the CAD domain of caspase-activated DNase and interaction with the CAD domain of its inhibitor. , 2000, Journal of molecular biology.

[60]  Frances M. G. Pearl,et al.  Review: what can structural classifications reveal about protein evolution? , 2001, Journal of structural biology.

[61]  L. Aravind,et al.  The RAGNYA fold: a novel fold with multiple topological variants found in functionally diverse nucleic acid, nucleotide and peptide-binding proteins , 2007, Nucleic acids research.

[62]  D. Kilburn,et al.  Structure and ligand binding of carbohydrate-binding module CsCBM6-3 reveals similarities with fucose-specific lectins and "galactose-binding" domains. , 2003, Journal of molecular biology.

[63]  R. Ghosh,et al.  Crystallographic structure of a PLP-dependent ornithine decarboxylase from Lactobacillus 30a to 3.0 A resolution. , 1995, Journal of molecular biology.

[64]  Stephen K. Burley,et al.  X-Ray Structures of Myc-Max and Mad-Max Recognizing DNA Molecular Bases of Regulation by Proto-Oncogenic Transcription Factors , 2003, Cell.

[65]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[66]  C. Ponting,et al.  beta-propeller repeats and a PDZ domain in the tricorn protease: predicted self-compartmentalisation and C-terminal polypeptide-binding strategies of substrate selection. , 1999, FEMS microbiology letters.