The value of protein structure classification information—Surveying the scientific literature

The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP–extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012–2013 that cite SCOP, 439 actually use data from the resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non‐SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings. Proteins 2015; 83:2025–2038. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc.

[1]  Joan Segura,et al.  A holistic in silico approach to predict functional sites in protein structures , 2012, Bioinform..

[2]  Yan Yuan Tseng,et al.  Classification of protein functional surfaces using structural characteristics , 2012, Proceedings of the National Academy of Sciences.

[3]  Patrice Koehl,et al.  ASTRAL compendium enhancements , 2002, Nucleic Acids Res..

[4]  Liisa Holm,et al.  Searching protein structure databases with DaliLite v.3 , 2008, Bioinform..

[5]  Gustavo Caetano-Anollés,et al.  Origin and Evolution of Protein Fold Designs Inferred from Phylogenomic Analysis of CATH Domain Structures in Proteomes , 2013, PLoS Comput. Biol..

[6]  Wouter Boomsma,et al.  Fast large-scale clustering of protein structures using Gauss integrals , 2012, Bioinform..

[7]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[8]  J. Gouaux,et al.  Structure of Staphylococcal α-Hemolysin, a Heptameric Transmembrane Pore , 1996, Science.

[9]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[10]  Ian Sillitoe,et al.  Extending CATH: increasing coverage of the protein structure universe and linking structure with function , 2010, Nucleic Acids Res..

[11]  宁北芳,et al.  疟原虫var基因转换速率变化导致抗原变异[英]/Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A , 2005 .

[12]  Christian Cole,et al.  The Jpred 3 secondary structure prediction server , 2008, Nucleic Acids Res..

[13]  M. Sternberg,et al.  Protein structure prediction on the Web: a case study using the Phyre server , 2009, Nature Protocols.

[14]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[15]  K. Geoghegan,et al.  Deconstruction of activity-dependent covalent modification of heme in human neutrophil myeloperoxidase by multistage mass spectrometry (MS(4)). , 2012, Biochemistry.

[16]  Ramanathan Sowdhamini,et al.  PASS2 version 4: An update to the database of structure-based sequence alignments of structural domain superfamilies , 2011, Nucleic Acids Res..

[17]  Stella Veretnik,et al.  Partitioning protein structures into domains: why is it so difficult? , 2006, Journal of molecular biology.

[18]  Frances M. G. Pearl,et al.  The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis , 2004, Nucleic Acids Res..

[19]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[20]  Frances M. G. Pearl,et al.  The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution , 2006, Nucleic Acids Res..

[21]  A G Murzin,et al.  SCOP, Structural Classification of Proteins database: applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data. , 1998, Acta crystallographica. Section D, Biological crystallography.

[22]  David A. Lee,et al.  Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space , 2006, Nucleic acids research.

[23]  R. Jernigan,et al.  Anisotropy of fluctuation dynamics of proteins with an elastic network model. , 2001, Biophysical journal.

[24]  Ian Sillitoe,et al.  The CATH classification revisited—architectures reviewed and new ways to characterize structural divergence in superfamilies , 2008, Nucleic Acids Res..

[25]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[26]  D T Jones,et al.  A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. , 1999, Structure.

[27]  James W. Murray,et al.  High–quality protein backbone reconstruction from alpha carbons using Gaussian mixture models , 2013, J. Comput. Chem..

[28]  C. Chothia,et al.  Understanding protein structure: using scop for fold interpretation. , 1996, Methods in enzymology.

[29]  Dong Xu,et al.  ThreaDom: extracting protein domain boundary information from multiple threading alignments , 2013, Bioinform..

[30]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[31]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[32]  C. Chothia,et al.  Evolution of oligomeric state through geometric coupling of protein interfaces , 2012, Proceedings of the National Academy of Sciences.

[33]  Ian Sillitoe,et al.  Exploring the Evolution of Novel Enzyme Functions within Structurally Defined Protein Superfamilies , 2012, PLoS Comput. Biol..

[34]  Steven E Brenner,et al.  The Impact of Structural Genomics: Expectations and Outcomes , 2005, Science.

[35]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[36]  James E. Bray,et al.  The CATH Database provides insights into protein structure/function relationships , 1999, Nucleic Acids Res..

[37]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[38]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[39]  Motonori Ota,et al.  PSCDB: a database for protein structural change upon ligand binding , 2011, Nucleic Acids Res..

[40]  Charles L. Brooks,et al.  Viral Capsid Proteins Are Segregated in Structural Fold Space , 2013, PLoS Comput. Biol..

[41]  Giorgio Colombo,et al.  Identification of domains in protein structures from the analysis of intramolecular interactions. , 2012, The journal of physical chemistry. B.

[42]  Alexey G. Murzin,et al.  SCOP2 prototype: a new approach to protein structure mining , 2014, Nucleic Acids Res..

[43]  L. Aravind,et al.  Bacterial GRAS domain proteins throw new light on gibberellic acid response mechanisms , 2012, Bioinform..

[44]  Charlotte M. Deane,et al.  Exploring Fold Space Preferences of New-born and Ancient Protein Superfamilies , 2013, PLoS Comput. Biol..

[45]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[46]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[47]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[48]  Saraswathi Vishveshwara,et al.  Insights into the Fold Organization of TIM Barrel from Interaction Energy Based Structure Networks , 2012, PLoS Comput. Biol..

[49]  C. Dobson,et al.  In vivo translation rates can substantially delay the cotranslational folding of the Escherichia coli cytosolic proteome , 2012, Proceedings of the National Academy of Sciences.

[50]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[51]  Hiroshi Suzuki,et al.  The four-transmembrane protein IP39 of Euglena forms strands by a trimeric unit repeat , 2013, Nature Communications.

[52]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[53]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[54]  Steven E. Brenner,et al.  Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison , 2002, Proc. IEEE.

[55]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[56]  Zhiping Weng,et al.  Protein–protein docking benchmark version 4.0 , 2010, Proteins.

[57]  P. Koehl,et al.  Capturing protein sequence–structure specificity using computational sequence design , 2013, Proteins.

[58]  M. Kojima,et al.  Small-angle X-ray scattering constraints and local geometry like secondary structures can construct a coarse-grained protein model at amino acid residue resolution. , 2013, Biochemical and biophysical research communications.

[59]  Jinn-Moon Yang,et al.  Protein structure database search and evolutionary classification , 2006, Nucleic acids research.

[60]  Sridhar Hariharaputran,et al.  Rebelling for a Reason: Protein Structural “Outliers” , 2013, PloS one.

[61]  J. Wöhnert,et al.  Structural and functional analysis of the archaeal endonuclease Nob1 , 2011, Nucleic acids research.

[62]  Dominik Gront,et al.  Assessing the accuracy of template-based structure prediction metaservers by comparison with structural genomics structures , 2012, Journal of Structural and Functional Genomics.

[63]  Yongqi Huang,et al.  Three‐dimensional domain swapping in the protein structure space , 2012, Proteins.

[64]  M. Sternberg,et al.  Enhanced genome annotation using structural profiles in the program 3D-PSSM. , 2000, Journal of molecular biology.

[65]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[66]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[67]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[68]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[69]  David A. Lee,et al.  CATH: comprehensive structural and functional annotations for genome sequences , 2014, Nucleic Acids Res..

[70]  Michael Levitt,et al.  On the universe of protein folds. , 2013, Annual review of biophysics.

[71]  S. Brenner,et al.  Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches , 2004, Proteins.

[72]  Ron Unger,et al.  N-terminal domains in two-domain proteins are biased to be shorter and predicted to fold faster than their C-terminal counterparts. , 2013, Cell reports.

[73]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[74]  Conrad C. Huang,et al.  UCSF Chimera—A visualization system for exploratory research and analysis , 2004, J. Comput. Chem..

[75]  K Henrick,et al.  Electronic Reprint Biological Crystallography Secondary-structure Matching (ssm), a New Tool for Fast Protein Structure Alignment in Three Dimensions Biological Crystallography Secondary-structure Matching (ssm), a New Tool for Fast Protein Structure Alignment in Three Dimensions , 2022 .

[76]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[77]  David A. Lee,et al.  New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures , 2012, Nucleic Acids Res..

[78]  Yaoqi Zhou,et al.  A new size‐independent score for pairwise protein structure alignment and its application to structure classification and nucleic‐acid binding prediction , 2012, Proteins.