CATH functional families predict protein functional sites

Motivation Identification of functional sites in proteins is essential for functional characterisation, variant interpretation and drug design. Several methods are available for predicting either a generic functional site, or specific types of functional site. Here, we present FunSite, a machine learning predictor that identifies catalytic, ligand-binding and protein-protein interaction functional sites using features derived from protein sequence and structure, and evolutionary data from CATH functional families (FunFams). Results FunSite’s prediction performance was rigorously benchmarked using cross-validation and a holdout dataset. FunSite outperformed all publicly-available functional site prediction methods. We show that conserved residues in FunFams are enriched in functional sites. We found FunSite’s performance depends greatly on the quality of functional site annotations and the information content of FunFams in the training data. Finally, we analyse which structural and evolutionary features are most predictive for functional sites. Availability The datasets and prediction models are available on request. Contact c.orengo@ucl.ac.uk Supplementary information Supplementary data are available at Bioinformatics online.

[1]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[2]  Michal Brylinski,et al.  eFindSite: Improved prediction of ligand binding sites in protein models using meta-threading, machine learning and auxiliary ligands , 2013, Journal of Computer-Aided Molecular Design.

[3]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[4]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[5]  C. Innis,et al.  Prediction of functional sites in proteins using conserved functional group analysis. , 2004, Journal of molecular biology.

[6]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[7]  Neera Borkakoti,et al.  Ranking Enzyme Structures in the PDB by Bound Ligand Similarity to Biological Substrates , 2018, Structure.

[8]  Bogdan Istrate,et al.  Algorithmic approaches to protein-protein interaction site prediction , 2015, Algorithms for Molecular Biology.

[9]  Ian Sillitoe,et al.  CATH: expanding the horizons of structure-based functional annotations for genome sequences , 2018, Nucleic Acids Res..

[10]  Michael I. Jordan,et al.  Active site prediction using evolutionary and structural information , 2010, Bioinform..

[11]  Fred P. Davis,et al.  The Overlap of Small Molecule and Protein Binding Sites within Families of Protein Structures , 2010, PLoS Comput. Biol..

[12]  Jia Wang,et al.  CRHunter: integrating multifaceted information to predict catalytic residues in enzymes , 2016, Scientific Reports.

[13]  Brian Kuhlman,et al.  Catalysis by a de novo zinc-mediated protein interface: implications for natural enzyme evolution and rational enzyme engineering. , 2012, Biochemistry.

[14]  Matthew Fraser,et al.  InterProScan 5: genome-scale protein function classification , 2014, Bioinform..

[15]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[16]  Janet M. Thornton,et al.  Large-Scale Analysis Exploring Evolution of Catalytic Machineries and Mechanisms in Enzyme Superfamilies , 2016, Journal of molecular biology.

[17]  Janet M. Thornton,et al.  Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites , 2017, Nucleic Acids Res..

[18]  Patricia C. Babbitt,et al.  New Insights about Enzyme Evolution from Large Scale Studies of Sequence and Structure Relationships* , 2014, The Journal of Biological Chemistry.

[19]  Tanja Kortemme,et al.  Design of Multi-Specificity in Protein Interfaces , 2007, PLoS Comput. Biol..

[20]  David A. Lee,et al.  Novel Computational Protocols for Functionally Classifying and Characterising Serine Beta-Lactamases , 2016, PLoS Comput. Biol..

[21]  K. Holt,et al.  Performance of neural network basecalling tools for Oxford Nanopore sequencing , 2019, Genome Biology.

[22]  Song Liu,et al.  Protein binding site prediction using an empirical scoring function , 2006, Nucleic acids research.

[23]  P. Radivojac,et al.  Evaluation of features for catalytic residue prediction in novel folds , 2007 .

[24]  Shao-Wei Huang,et al.  EXIA2: Web Server of Accurate and Rapid Protein Catalytic Residue Prediction , 2014, BioMed research international.

[25]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[26]  R. Raz,et al.  ProMate: a structure based prediction program to identify the location of protein-protein binding sites. , 2004, Journal of molecular biology.

[27]  A. Valencia,et al.  Automatic methods for predicting functionally important residues. , 2003, Journal of molecular biology.

[28]  Ian Sillitoe,et al.  cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly , 2018, Bioinform..

[29]  David A. Lee,et al.  Functional classification of CATH superfamilies: a domain-based approach for protein function annotation , 2015, Bioinform..

[30]  Yang Zhang,et al.  BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions , 2012, Nucleic Acids Res..

[31]  M. Sternberg,et al.  The Contribution of Missense Mutations in Core and Rim Residues of Protein–Protein Interfaces to Human Disease , 2015, Journal of molecular biology.

[32]  Angela D. Wilkins,et al.  Evolutionary trace for prediction and redesign of protein functional sites. , 2012, Methods in molecular biology.

[33]  V. Helms,et al.  Composition of Overlapping Protein-Protein and Protein-Ligand Interfaces , 2015, PloS one.

[34]  Huan-Xiang Zhou,et al.  meta-PPISP: a meta web server for protein-protein interaction site prediction , 2007, Bioinform..

[35]  C. Chothia,et al.  The generation of new protein functions by the combination of domains. , 2007, Structure.

[36]  Shailesh Kumar,et al.  CSmetaPred: a consensus method for prediction of catalytic residues , 2017, BMC Bioinformatics.

[37]  Daniel R. Caffrey,et al.  Are protein–protein interfaces more conserved in sequence than the rest of the protein surface? , 2004, Protein science : a publication of the Protein Society.

[38]  Vasant G Honavar,et al.  Computational prediction of protein interfaces: A review of data driven methods , 2015, FEBS letters.

[39]  Mona Singh,et al.  Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure , 2009, PLoS Comput. Biol..

[40]  Tapio Salakoski,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[41]  Scott M. Lundberg,et al.  Consistent Individualized Feature Attribution for Tree Ensembles , 2018, ArXiv.

[42]  A. Valencia,et al.  Emerging methods in protein co-evolution , 2013, Nature Reviews Genetics.

[43]  Benoit H. Dessailly,et al.  Functional site plasticity in domain superfamilies☆ , 2013, Biochimica et biophysica acta.

[44]  Michal Brylinski,et al.  FINDSITE: a combined evolution/structure-based approach to protein function prediction , 2009, Briefings Bioinform..

[45]  Barbara Imperiali,et al.  Protein oligomerization: how and why. , 2005, Bioorganic & medicinal chemistry.

[46]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[47]  Michael J. E. Sternberg,et al.  3DLigandSite: predicting ligand-binding sites using similar structures , 2010, Nucleic Acids Res..

[48]  Itay Mayrose,et al.  ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules , 2016, Nucleic Acids Res..

[49]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[50]  Lukasz A. Kurgan,et al.  Accurate sequence-based prediction of catalytic residues , 2008, Bioinform..

[51]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[52]  Mona Singh,et al.  Characterization and prediction of residues determining protein functional specificity , 2008, Bioinform..

[53]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[54]  Tal Pupko,et al.  ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids , 2010, Nucleic Acids Res..

[55]  Huan-Xiang Zhou,et al.  Prediction of interface residues in protein–protein complexes by a consensus neural network method: Test against NMR data , 2005, Proteins.

[56]  S. Jones,et al.  Analysis of protein-protein interaction sites using surface patches. , 1997, Journal of molecular biology.

[57]  Ishita K. Khan,et al.  Exploring Structure – Function Relationships in Moonlighting Proteins , 2017 .

[58]  Ivan Rayment,et al.  Evolution of enzymatic activity in the enolase superfamily: functional studies of the promiscuous o-succinylbenzoate synthase from Amycolatopsis. , 2004, Biochemistry.

[59]  J. Thornton,et al.  Tess: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites , 1997, Protein science : a publication of the Protein Society.

[60]  Benjamin A. Shoemaker,et al.  IBIS (Inferred Biomolecular Interaction Server) reports, predicts and integrates multiple types of conserved interactions for proteins , 2011, Nucleic Acids Res..

[61]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..