PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework.

Determining the catalytic residues in an enzyme is critical to our understanding the relationship between protein sequence, structure, function, and enhancing our ability to design novel enzymes and their inhibitors. Although many enzymes have been sequenced, and their primary and tertiary structures determined, experimental methods for enzyme functional characterization lag behind. Because experimental methods used for identifying catalytic residues are resource- and labor-intensive, computational approaches have considerable value and are highly desirable for their ability to complement experimental studies in identifying catalytic residues and helping to bridge the sequence-structure-function gap. In this study, we describe a new computational method called PREvaIL for predicting enzyme catalytic residues. This method was developed by leveraging a comprehensive set of informative features extracted from multiple levels, including sequence, structure, and residue-contact network, in a random forest machine-learning framework. Extensive benchmarking experiments on eight different datasets based on 10-fold cross-validation and independent tests, as well as side-by-side performance comparisons with seven modern sequence- and structure-based methods, showed that PREvaIL achieved competitive predictive performance, with an area under the receiver operating characteristic curve and area under the precision-recall curve ranging from 0.896 to 0.973 and from 0.294 to 0.523, respectively. We demonstrated that this method was able to capture useful signals arising from different levels, leveraging such differential but useful types of features and allowing us to significantly improve the performance of catalytic residue prediction. We believe that this new method can be utilized as a valuable tool for both understanding the complex sequence-structure-function relationships of proteins and facilitating the characterization of novel enzymes lacking functional annotations.

[1]  S. Mondal,et al.  PINGU: PredIction of eNzyme catalytic residues usinG seqUence information , 2015, PloS one.

[2]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[3]  R. Nussinov,et al.  Residue centrality, functionally important residues, and active site shape: Analysis of enzyme and non‐enzyme families , 2006, Protein science : a publication of the Protein Society.

[4]  Kuo-Chen Chou,et al.  A novel approach to predict active sites of enzyme molecules , 2004, Proteins.

[5]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[6]  Zhanchao Li,et al.  Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. , 2007, Journal of theoretical biology.

[7]  M. Bakhtiarizadeh,et al.  OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou's pseudo amino acid composition. , 2017, Journal of theoretical biology.

[8]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[9]  Jiangning Song,et al.  Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information , 2006, BMC Bioinformatics.

[10]  Hassan Mohabatkar,et al.  Prediction of allergenic proteins by means of the concept of Chou's pseudo amino acid composition and a machine learning approach. , 2012, Medicinal chemistry (Shariqah (United Arab Emirates)).

[11]  Z. Wen,et al.  Novel Feature for Catalytic Protein Residues Reflecting Interactions with Other Residues , 2011, PloS one.

[12]  Lukasz A. Kurgan,et al.  DFLpred: High-throughput prediction of disordered flexible linker regions in protein sequences , 2016, Bioinform..

[13]  Shao-Wei Huang,et al.  Accurate Prediction of Protein Catalytic Residues by Side Chain Orientation and Residue Contact Density , 2012, PloS one.

[14]  H. Mohabatkar,et al.  Predicting anticancer peptides with Chou's pseudo amino acid composition and investigating their mutagenicity via Ames test. , 2014, Journal of theoretical biology.

[15]  Geoffrey I. Webb,et al.  Prodepth: Predict Residue Depth by Support Vector Regression Approach from Protein Sequences Only , 2009, PloS one.

[16]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[17]  H. Mohabatkar,et al.  Analysis and comparison of lignin peroxidases between fungi and bacteria using three different modes of Chou's general pseudo amino acid composition. , 2016, Journal of theoretical biology.

[18]  Jiangning Song,et al.  Prediction of protein folding rates from primary sequence by fusing multiple sequential features , 2009 .

[19]  Lukasz A. Kurgan,et al.  MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins , 2012, Bioinform..

[20]  Jia Wang,et al.  CRHunter: integrating multifaceted information to predict catalytic residues in enzymes , 2016, Scientific Reports.

[21]  Cathy H. Wu,et al.  Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties , 2006, BMC Bioinformatics.

[22]  Brian T. Sutch,et al.  Predicting protein functional sites with phylogenetic motifs , 2004, Proteins.

[23]  Zheng Yuan,et al.  Prediction of protein B‐factor profiles , 2005, Proteins.

[24]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[25]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[26]  Kuo-Chen Chou,et al.  An Unprecedented Revolution in Medicinal Chemistry Driven by the Progress of Biological Science. , 2017, Current topics in medicinal chemistry.

[27]  K. Chou,et al.  Role of the protein outside active site on the diffusion-controlled reaction of enzymes , 1982 .

[28]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[29]  George Karypis,et al.  LIBRUS: combined machine learning and homology information for sequence-based ligand-binding residue prediction , 2009, Bioinform..

[30]  Jiangning Song,et al.  HSEpred: predict half-sphere exposure from protein sequences , 2008, Bioinform..

[31]  Hassan Mohabatkar,et al.  Prediction of cyclin proteins using Chou's pseudo amino acid composition. , 2010, Protein and peptide letters.

[32]  Xing-Ming Zhao,et al.  Cascleave 2.0, a new approach for predicting caspase and granzyme cleavage targets , 2014, Bioinform..

[33]  A. Esmaeili,et al.  Prediction of GABAA receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine. , 2011, Journal of theoretical biology.

[34]  Yong-Zi Chen,et al.  An improved prediction of catalytic residues in enzyme structures. , 2008, Protein engineering, design & selection : PEDS.

[35]  Geoffrey I. Webb,et al.  Cascleave: towards more accurate prediction of caspase substrate cleavage sites , 2010, Bioinform..

[36]  Janet M. Thornton,et al.  The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes , 2013, Nucleic Acids Res..

[37]  Muhammad Tahir,et al.  Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou's trinucleotide composition , 2017, Comput. Methods Programs Biomed..

[38]  H. Mohabatkar,et al.  Prediction of metalloproteinase family based on the concept of Chou’s pseudo amino acid composition using a machine learning approach , 2011, Journal of Structural and Functional Genomics.

[39]  C. Khosla,et al.  Modular enzymes : Biocatalysis: Synthesis methods that exploit enzymatic activities , 2001 .

[40]  Geoffrey I. Webb,et al.  GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome , 2015, Bioinform..

[41]  C. Kuo-chen,et al.  Studies on the rate of diffusion-controlled reactions of enzymes. Spatial factor and force field factor. , 1974, Scientia Sinica.

[42]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[43]  Gail J. Bartlett,et al.  Using a neural network and spatial clustering to predict the location of active sites in enzymes. , 2003, Journal of molecular biology.

[44]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[45]  Dennis R. Livesay,et al.  How accurate and statistically robust are catalytic site predictions based on closeness centrality? , 2007, BMC Bioinformatics.

[46]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[47]  Johannes Söding,et al.  Prediction of protein functional residues from sequence by probability density estimation , 2008, Bioinform..

[48]  Michael I. Jordan,et al.  Active site prediction using evolutionary and structural information , 2010, Bioinform..

[49]  Lukasz A. Kurgan,et al.  Accurate sequence-based prediction of catalytic residues , 2008, Bioinform..

[50]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[51]  P. Radivojac,et al.  Improved amino acid flexibility parameters , 2003, Protein science : a publication of the Protein Society.

[52]  P. Radivojac,et al.  Evaluation of features for catalytic residue prediction in novel folds , 2007 .

[53]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[54]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[55]  Christine A. Orengo,et al.  Inferring Function Using Patterns of Native Disorder in Proteins , 2007, PLoS Comput. Biol..

[56]  Shoba Ranganathan,et al.  Prediction of interface residue based on the features of residue interaction network. , 2017, Journal of theoretical biology.

[57]  Mandana Behbahani,et al.  Computational prediction of anti HIV‐1 peptides and in vitro evaluation of anti HIV‐1 activity of HIV‐1 P24‐derived peptides , 2015, Journal of peptide science : an official publication of the European Peptide Society.

[58]  Mandana Behbahani,et al.  Predicting antibacterial peptides by the concept of Chou's pseudo-amino acid composition and machine learning methods. , 2012, Protein and peptide letters.

[59]  T. Hamelryck An amino acid has two sides: A new 2D measure provides a different view of solvent exposure , 2005, Proteins.

[60]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[61]  David S. Goodsell,et al.  The RCSB protein data bank: integrative view of protein, gene and 3D structural information , 2016, Nucleic Acids Res..

[62]  Jun Wang,et al.  L1pred: A Sequence-Based Prediction Tool for Catalytic Residues in Enzymes with the L1-logreg Classifier , 2012, PloS one.

[63]  Prabina Kumar Meher,et al.  Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC , 2017, Scientific Reports.

[64]  M. Esmaeili,et al.  Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. , 2010, Journal of theoretical biology.

[65]  Jiangning Song,et al.  Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure , 2007, Bioinform..

[66]  Yoav Freund,et al.  ResBoost: characterizing and predicting catalytic residues in enzymes , 2009, BMC Bioinformatics.

[67]  Ziding Zhang,et al.  Identification of Catalytic Residues Using a Novel Feature that Integrates the Microenvironment and Geometrical Location Properties of Residues , 2012, PloS one.

[68]  Geoffrey I. Webb,et al.  PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection , 2017, Scientific Reports.

[69]  Steven Myers,et al.  Structure-based kernels for the prediction of catalytic residues and their involvement in human inherited disease , 2010, BMC Bioinformatics.

[70]  Geoffrey I. Webb,et al.  MetalExplorer, a Bioinformatics Tool for the Improved Prediction of Eight Types of Metal-Binding Sites Using a Random Forest Algorithm with Two- Step Feature Selection , 2017 .

[71]  Xiaoqi Zheng,et al.  Prediction of catalytic residues based on an overlapping amino acid classification , 2010, Amino Acids.

[72]  Geoffrey I. Webb,et al.  Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features , 2014, Scientific Reports.

[73]  P. R. Gardner,et al.  Globins Scavenge Sulfur Trioxide Anion Radical* , 2015, The Journal of Biological Chemistry.

[74]  Kuo-Chen Chou,et al.  pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. , 2016, Journal of theoretical biology.

[75]  W. Zhong,et al.  Molecular Science for Drug Development and Biomedicine , 2014, International journal of molecular sciences.

[76]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[77]  Hassan Mohabatkar,et al.  An Evaluation on Different Machine Learning Algorithms for Classification and Prediction of Antifungal Peptides. , 2016, Medicinal chemistry (Shariqah (United Arab Emirates)).

[78]  B. Seaton,et al.  Crystal structure at 2.8 Å resolution of anabolic ornithine transcarbamylase from Escherichia coli , 1997, Nature Structural Biology.

[79]  A. del Sol,et al.  Small‐world network approach to identify key residues in protein–protein interaction , 2004, Proteins.

[80]  Gil Amitai,et al.  Network analysis of protein structures identifies functional residues. , 2004, Journal of molecular biology.

[81]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[82]  Felice C. Lightstone,et al.  Catalytic site identification—a web server to identify catalytic site structural matches throughout PDB , 2013, Nucleic Acids Res..

[83]  J. Wells,et al.  Dissecting the catalytic triad of a serine protease , 1988, Nature.

[84]  S. Khan,et al.  Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC. , 2017, Journal of theoretical biology.

[85]  W. Zhong,et al.  Diffusion‐Controlled Reactions of Enzymes , 2005 .

[86]  Jiangning Song,et al.  An Integrative Computational Framework Based on a Two-Step Random Forest Algorithm Improves Prediction of Zinc-Binding Sites in Proteins , 2012, PloS one.

[87]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[88]  Burkhard Rost,et al.  Protein–Protein Interaction Hotspots Carved into Sequences , 2007, PLoS Comput. Biol..

[89]  D. Biria,et al.  Classification of DNA Minor and Major Grooves Binding Proteins According to the NLSs by Data Analysis Methods , 2014, Applied Biochemistry and Biotechnology.

[90]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[91]  Kuo-Chen Chou,et al.  iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. , 2015, Journal of theoretical biology.

[92]  Xing-Ming Zhao,et al.  FunSAV: Predicting the Functional Effect of Single Amino Acid Variants Using a Two-Stage Random Forest Model , 2012, PloS one.

[93]  Gisele L. Pappa,et al.  GASS: identifying enzyme active sites with genetic algorithms , 2015, Bioinform..

[94]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[95]  Geoffrey I. Webb,et al.  PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites , 2012, PloS one.

[96]  David T. Jones,et al.  DISOPRED3: precise disordered region predictions with annotated protein-binding activity , 2014, Bioinform..

[97]  A. Panchenko,et al.  Prediction of functional sites by analysis of sequence and structure conservation , 2004, Protein science : a publication of the Protein Society.

[98]  Bairong Shen,et al.  Amino acid network for prediction of catalytic residues in enzymes: a comparison survey. , 2016, Current protein & peptide science.

[99]  Theo Wallimann,et al.  Structure of mitochondrial creatine kinase , 1996, Nature.

[100]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[101]  Jiangning Song,et al.  ZincExplorer: an accurate hybrid method to improve the prediction of zinc-binding sites from protein sequences. , 2013, Molecular bioSystems.

[102]  Kuo-Chen Chou,et al.  iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier , 2016, Oncotarget.

[103]  Andrea Passerini,et al.  Automatic prediction of catalytic residues by modeling residue structural neighborhood , 2010, BMC Bioinformatics.