CATH functional families predict functional sites in proteins

MOTIVATION Identification of functional sites in proteins is essential for functional characterization, variant interpretation and drug design. Several methods are available for predicting either a generic functional site, or specific types of functional site. Here, we present FunSite, a machine learning predictor that identifies catalytic, ligand-binding and protein-protein interaction functional sites using features derived from protein sequence and structure, and evolutionary data from CATH functional families (FunFams). RESULTS FunSite's prediction performance was rigorously benchmarked using cross-validation and a holdout dataset. FunSite outperformed other publicly-available functional site prediction methods. We show that conserved residues in FunFams are enriched in functional sites. We found FunSite's performance depends greatly on the quality of functional site annotations and the information content of FunFams in the training data. Finally, we analyse which structural and evolutionary features are most predictive for functional sites. AVAILABILITY https://github.com/UCL/cath-funsite-predictor. CONTACT c.orengo@ucl.ac.uk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Ian Sillitoe,et al.  cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly , 2018, Bioinform..

[2]  Brian Kuhlman,et al.  Catalysis by a de novo zinc-mediated protein interface: implications for natural enzyme evolution and rational enzyme engineering. , 2012, Biochemistry.

[3]  M. Sternberg,et al.  The Contribution of Missense Mutations in Core and Rim Residues of Protein–Protein Interfaces to Human Disease , 2015, Journal of molecular biology.

[4]  Matthew Fraser,et al.  InterProScan 5: genome-scale protein function classification , 2014, Bioinform..

[5]  A. Valencia,et al.  Automatic methods for predicting functionally important residues. , 2003, Journal of molecular biology.

[6]  David A. Lee,et al.  Functional classification of CATH superfamilies: a domain-based approach for protein function annotation , 2015, Bioinform..

[7]  S. O’Brien,et al.  Evaluation and Integration of Genetic Signature for Prediction Risk of Nasopharyngeal Carcinoma in Southern China , 2014, BioMed research international.

[8]  P. Radivojac,et al.  Evaluation of features for catalytic residue prediction in novel folds , 2007 .

[9]  Tommi Vatanen,et al.  Structure-Based Function Prediction using Graph Convolutional Networks , 2019, bioRxiv.

[10]  C. Chothia,et al.  The generation of new protein functions by the combination of domains. , 2007, Structure.

[11]  Neera Borkakoti,et al.  Ranking Enzyme Structures in the PDB by Bound Ligand Similarity to Biological Substrates , 2018, Structure.

[12]  Mona Singh,et al.  Systematic domain-based aggregation of protein structures highlights DNA-, RNA- and other ligand-binding positions , 2018, bioRxiv.

[13]  Angela D. Wilkins,et al.  Evolutionary trace for prediction and redesign of protein functional sites. , 2012, Methods in molecular biology.

[14]  Jari Björne,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[15]  V. Helms,et al.  Composition of Overlapping Protein-Protein and Protein-Ligand Interfaces , 2015, PloS one.

[16]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[17]  Janet M. Thornton,et al.  Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites , 2017, Nucleic Acids Res..

[18]  C. Innis,et al.  Prediction of functional sites in proteins using conserved functional group analysis. , 2004, Journal of molecular biology.

[19]  Yang Zhang,et al.  BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions , 2012, Nucleic Acids Res..

[20]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[21]  Bogdan Istrate,et al.  Algorithmic approaches to protein-protein interaction site prediction , 2015, Algorithms for Molecular Biology.

[22]  Guozheng Qin,et al.  RNA methylomes reveal the m6A-mediated regulation of DNA demethylase gene SlDML2 in tomato fruit ripening , 2019, Genome Biology.

[23]  Ian Sillitoe,et al.  CATH: expanding the horizons of structure-based functional annotations for genome sequences , 2018, Nucleic Acids Res..

[24]  J. Thornton,et al.  Tess: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites , 1997, Protein science : a publication of the Protein Society.

[25]  Huan-Xiang Zhou,et al.  meta-PPISP: a meta web server for protein-protein interaction site prediction , 2007, Bioinform..

[26]  Scott M. Lundberg,et al.  Consistent Individualized Feature Attribution for Tree Ensembles , 2018, ArXiv.

[27]  Benjamin A. Shoemaker,et al.  IBIS (Inferred Biomolecular Interaction Server) reports, predicts and integrates multiple types of conserved interactions for proteins , 2011, Nucleic Acids Res..

[28]  Daniel R. Caffrey,et al.  Are protein–protein interfaces more conserved in sequence than the rest of the protein surface? , 2004, Protein science : a publication of the Protein Society.

[29]  A. Valencia,et al.  Emerging methods in protein co-evolution , 2013, Nature Reviews Genetics.

[30]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[31]  Michal Brylinski,et al.  FINDSITE: a combined evolution/structure-based approach to protein function prediction , 2009, Briefings Bioinform..

[32]  Janet M. Thornton,et al.  Large-Scale Analysis Exploring Evolution of Catalytic Machineries and Mechanisms in Enzyme Superfamilies , 2016, Journal of molecular biology.

[33]  Song Liu,et al.  Protein binding site prediction using an empirical scoring function , 2006, Nucleic acids research.

[34]  Tanja Kortemme,et al.  Design of Multi-Specificity in Protein Interfaces , 2007, PLoS Comput. Biol..

[35]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[36]  Michael J. E. Sternberg,et al.  3DLigandSite: predicting ligand-binding sites using similar structures , 2010, Nucleic Acids Res..

[37]  Fred P. Davis,et al.  The Overlap of Small Molecule and Protein Binding Sites within Families of Protein Structures , 2010, PLoS Comput. Biol..

[38]  Michal Brylinski,et al.  eFindSite: Improved prediction of ligand binding sites in protein models using meta-threading, machine learning and auxiliary ligands , 2013, Journal of Computer-Aided Molecular Design.

[39]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[40]  Christine Orengo,et al.  Choosing the Best Enzyme Complex Structure Made Easy. , 2018, Structure.

[41]  Huan-Xiang Zhou,et al.  Prediction of interface residues in protein–protein complexes by a consensus neural network method: Test against NMR data , 2005, Proteins.

[42]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[43]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[44]  Lukasz A. Kurgan,et al.  Accurate sequence-based prediction of catalytic residues , 2008, Bioinform..

[45]  Shailesh Kumar,et al.  CSmetaPred: a consensus method for prediction of catalytic residues , 2017, BMC Bioinformatics.

[46]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[47]  Itay Mayrose,et al.  ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules , 2016, Nucleic Acids Res..

[48]  Benoit H. Dessailly,et al.  Functional site plasticity in domain superfamilies☆ , 2013, Biochimica et biophysica acta.

[49]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[50]  Jia Wang,et al.  CRHunter: integrating multifaceted information to predict catalytic residues in enzymes , 2016, Scientific Reports.

[51]  Ishita K. Khan,et al.  Exploring Structure – Function Relationships in Moonlighting Proteins , 2017 .

[52]  R. Raz,et al.  ProMate: a structure based prediction program to identify the location of protein-protein binding sites. , 2004, Journal of molecular biology.

[53]  Vasant G Honavar,et al.  Computational prediction of protein interfaces: A review of data driven methods , 2015, FEBS letters.

[54]  Mona Singh,et al.  Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure , 2009, PLoS Comput. Biol..

[55]  David A. Lee,et al.  Novel Computational Protocols for Functionally Classifying and Characterising Serine Beta-Lactamases , 2016, PLoS Comput. Biol..

[56]  Barbara Imperiali,et al.  Protein oligomerization: how and why. , 2005, Bioorganic & medicinal chemistry.

[57]  Shao-Wei Huang,et al.  EXIA2: Web Server of Accurate and Rapid Protein Catalytic Residue Prediction , 2014, BioMed research international.

[58]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[59]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[60]  Mona Singh,et al.  Characterization and prediction of residues determining protein functional specificity , 2008, Bioinform..

[61]  Ivan Rayment,et al.  Evolution of enzymatic activity in the enolase superfamily: functional studies of the promiscuous o-succinylbenzoate synthase from Amycolatopsis. , 2004, Biochemistry.

[62]  Patricia C. Babbitt,et al.  New Insights about Enzyme Evolution from Large Scale Studies of Sequence and Structure Relationships* , 2014, The Journal of Biological Chemistry.

[63]  S. Jones,et al.  Analysis of protein-protein interaction sites using surface patches. , 1997, Journal of molecular biology.

[64]  Michael I. Jordan,et al.  Active site prediction using evolutionary and structural information , 2010, Bioinform..

[65]  Mallur S. Madhusudhan,et al.  Depth: a web server to compute depth, cavity sizes, detect potential small-molecule ligand-binding cavities and predict the pKa of ionizable residues in proteins , 2013, Nucleic Acids Res..

[66]  Tal Pupko,et al.  ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids , 2010, Nucleic Acids Res..