Biocuration in the structure–function linkage database: the anatomy of a superfamily

With ever-increasing amounts of sequence data available in both the primary literature and sequence repositories, there is a bottleneck in annotating molecular function to a sequence. This article describes the biocuration process and methods used in the structure-function linkage database (SFLD) to help address some of the challenges. We discuss how the hierarchy within the SFLD allows us to infer detailed functional properties for functionally diverse enzyme superfamilies in which all members are homologous, conserve an aspect of their chemical function and have associated conserved structural features that enable the chemistry. Also presented is the Enzyme StructureFunction Ontology (ESFO), which has been designed to capture the relationships between enzyme sequence, structure and function that underlie the SFLD and is used to guide the biocuration processes within the SFLD. Database URL: http://sfld.rbvi.ucsf.edu/

[1]  Bernard Henrissat,et al.  Key challenges for the creation and maintenance of specialist protein resources , 2015, Proteins.

[2]  Dan S. Tawfik,et al.  Enzyme promiscuity: a mechanistic and evolutionary perspective. , 2010, Annual review of biochemistry.

[3]  G. H. Reed,et al.  The enolase superfamily: a general strategy for enzyme-catalyzed abstraction of the alpha-protons of carboxylic acids. , 1996, Biochemistry.

[4]  Heidi J. Imker,et al.  Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks. , 2015, Biochimica et biophysica acta.

[5]  Conrad C. Huang,et al.  Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database. , 2006, Biochemistry.

[6]  Patricia C. Babbitt,et al.  New Insights about Enzyme Evolution from Large Scale Studies of Sequence and Structure Relationships* , 2014, The Journal of Biological Chemistry.

[7]  David A. Lee,et al.  GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains , 2009, Nucleic acids research.

[8]  S. Copley,et al.  An evolutionary perspective on protein moonlighting. , 2014, Biochemical Society transactions.

[9]  Silvio C. E. Tosatto,et al.  InterPro in 2017—beyond protein family and domain annotations , 2016, Nucleic Acids Res..

[10]  Anushya Muruganujan,et al.  PANTHER version 10: expanded protein families and functions, and analysis tools , 2015, Nucleic Acids Res..

[11]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[12]  Janet M. Thornton,et al.  The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes , 2013, Nucleic Acids Res..

[13]  Dannie Durand,et al.  Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins , 2008, PLoS Comput. Biol..

[14]  Tadashi Eguchi,et al.  Characterization of a radical S-adenosyl-L-methionine epimerase, NeoN, in the last step of neomycin B biosynthesis. , 2014, Journal of the American Chemical Society.

[15]  Gemma L. Holliday,et al.  Characterizing the complexity of enzymes on the basis of their mechanisms and structures with a bio-computational analysis , 2011, The FEBS journal.

[16]  Robert D. Finn,et al.  DUFs: families in search of function , 2010, Acta crystallographica. Section F, Structural biology and crystallization communications.

[17]  Liisa Holm,et al.  PANNZER: high-throughput functional annotation of uncharacterized proteins in an error-prone environment , 2015, Bioinform..

[18]  Patricia C. Babbitt,et al.  Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space , 2013, PLoS Comput. Biol..

[19]  G L Kenyon,et al.  Mechanism of the reaction catalyzed by mandelate racemase. 1. Chemical and kinetic evidence for a two-base mechanism. , 1991, Biochemistry.

[20]  Karen N. Allen,et al.  Evolutionary genomics of the HAD superfamily: understanding the structural adaptations and catalytic diversity in a superfamily of phosphoesterases and allied enzymes. , 2006, Journal of molecular biology.

[21]  Erin Beck,et al.  TIGRFAMs and Genome Properties in 2013 , 2012, Nucleic Acids Res..

[22]  Shoshana D. Brown,et al.  A gold standard set of mechanistically diverse enzyme superfamilies , 2006, Genome Biology.

[23]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[24]  Patricia C. Babbitt,et al.  Pythoscape: a framework for generation of large protein similarity networks , 2012, Bioinform..

[25]  Geng-Ming Hu,et al.  Visualizing and Clustering Protein Similarity Networks: Sequences, Structures, and Functions. , 2016, Journal of proteome research.

[26]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[27]  Elisabeth R. M. Tillier,et al.  The accuracy of several multiple sequence alignment programs for proteins , 2006, BMC Bioinformatics.

[28]  Conrad C. Huang,et al.  UCSF Chimera—A visualization system for exploratory research and analysis , 2004, J. Comput. Chem..

[29]  Patricia C Babbitt,et al.  The evolution of function in strictosidine synthase‐like proteins , 2011, Proteins.

[30]  Fangfang Xia,et al.  The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) , 2013, Nucleic Acids Res..

[31]  Ying Cheng,et al.  The European Nucleotide Archive , 2010, Nucleic Acids Res..

[32]  Gemma L. Holliday,et al.  MACiE: exploring the diversity of biochemical reactions , 2011, Nucleic Acids Res..

[33]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[34]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[35]  Huaiyu Mi,et al.  The InterPro protein families database: the classification resource after 15 years , 2014, Nucleic Acids Res..

[36]  P. Babbitt,et al.  Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. , 2001, Annual review of biochemistry.

[37]  Ian Sillitoe,et al.  Extending CATH: increasing coverage of the protein structure universe and linking structure with function , 2010, Nucleic Acids Res..

[38]  S J Remington,et al.  The alpha/beta hydrolase fold. , 1992, Protein engineering.

[39]  Tsuyoshi Kato,et al.  EzCatDB: the enzyme reaction database, 2015 update , 2014, Nucleic Acids Res..

[40]  Konstantina S. Nikita,et al.  A similarity network approach for the analysis and comparison of protein sequence/structure sets , 2010, J. Biomed. Informatics.

[41]  David A. Lee,et al.  Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis , 2013, Nucleic Acids Res..

[42]  Nick V. Grishin,et al.  Pclust: protein network visualization highlighting experimental data , 2013, Bioinform..

[43]  G L Kenyon,et al.  Mechanism of the reaction catalyzed by mandelate racemase: structure and mechanistic properties of the D270N mutant. , 1995, Biochemistry.

[44]  Eric Bapteste,et al.  EGN: a wizard for construction of gene and genome similarity networks , 2013, BMC Evolutionary Biology.

[45]  D. Herschlag,et al.  Catalytic promiscuity and the evolution of new enzymatic activities. , 1999, Chemistry & biology.

[46]  S. Eddy,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[47]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[48]  Michael A. Hicks,et al.  The Structure–Function Linkage Database , 2013, Nucleic Acids Res..

[49]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[50]  Conrad C. Huang,et al.  Representing Structure-Function Relationships in Mechanistically Diverse Enzyme Superfamilies , 2004, Pacific Symposium on Biocomputing.

[51]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[52]  Eduardo Corel,et al.  Network-Thinking: Graphs to Analyze Microbial Complexity and Evolution , 2016, Trends in microbiology.

[53]  Haruki Nakamura,et al.  The archiving and dissemination of biological structure data. , 2016, Current opinion in structural biology.

[54]  I-Min A. Chen,et al.  IMG 4 version of the integrated microbial genomes comparative analysis system , 2013, Nucleic Acids Res..

[55]  Thomas E. Ferrin,et al.  Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies , 2009, PloS one.

[56]  Duncan P. Brown,et al.  Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[57]  Richard N. Armstrong,et al.  Large-Scale Determination of Sequence, Structure, and Function Relationships in Cytosolic Glutathione Transferases across the Biosphere , 2014, PLoS biology.

[58]  Prudence Mutowo-Meullenet,et al.  The GOA database: Gene Ontology annotation updates for 2015 , 2014, Nucleic Acids Res..

[59]  Neil D. Rawlings,et al.  Creating a specialist protein resource network , 2015 .