Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information

Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of functions and physiological roles of proteases, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by taking into account both their sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave is capable of correctly identifying most cleavage sites in the case study. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave suggests a number of potential novel target substrates and their corresponding cleavage sites of different proteases. Procleave is implemented as a webserver and is freely accessible at http://procleave.erc.monash.edu/.

[1]  Ying Zhang,et al.  Structural determinants of limited proteolysis. , 2011, Journal of proteome research.

[2]  Geoffrey I. Webb,et al.  Cascleave: towards more accurate prediction of caspase substrate cleavage sites , 2010, Bioinform..

[3]  S Goelz,et al.  The crystal structure of human interferon beta at 2.2-A resolution. , 1997 .

[4]  J. Galagan,et al.  Conrad: gene prediction using conditional random fields. , 2007, Genome research.

[5]  David S. Goodsell,et al.  The RCSB Protein Data Bank: redesigned web site and web services , 2010, Nucleic Acids Res..

[6]  Geoffrey I. Webb,et al.  Positive-unlabelled learning of glycosylation sites in the human proteome , 2019, BMC Bioinformatics.

[7]  Piotr Cieplak,et al.  Sequence‐derived structural features driving proteolytic processing , 2014, Proteomics.

[8]  Geoffrey I. Webb,et al.  GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome , 2015, Bioinform..

[9]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[10]  Kris Gevaert,et al.  SitePredicting the cleavage of proteinase substrates. , 2009, Trends in biochemical sciences.

[11]  G. Salvesen,et al.  Structural and kinetic determinants of protease substrates , 2009, Nature Structural &Molecular Biology.

[12]  L. Esser,et al.  A novel ATP‐dependent conformation in p97 N–D1 fragment revealed by crystal structures of disease‐related mutants , 2010, The EMBO journal.

[13]  Geoffrey I. Webb,et al.  GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features , 2016, Scientific Reports.

[14]  Koenraad Van Leemput,et al.  Prediction of kinase-specific phosphorylation sites using conditional random fields , 2008, Bioinform..

[15]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[16]  Hashem Tamimi,et al.  Developing a powerful In Silico tool for the discovery of novel caspase-3 substrates: a preliminary screening of the human proteome , 2011, BMC Bioinformatics.

[17]  Oliviero Carugo,et al.  CX, an algorithm that identifies protruding atoms in proteins , 2002, Bioinform..

[18]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[19]  Geoffrey I. Webb,et al.  DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites , 2019, Bioinform..

[20]  K. Nishikawa,et al.  Radial locations of amino acid residues in a globular protein: correlation with the sequence. , 1986, Journal of biochemistry.

[21]  Gholamreza Haffari,et al.  PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. , 2018, Journal of theoretical biology.

[22]  James C. Whisstock,et al.  PoPS: a computational tool for modeling and predicting protease specificity , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[23]  Neil D. Rawlings,et al.  The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database , 2017, Nucleic Acids Res..

[24]  Gholamreza Haffari,et al.  Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods , 2018, Briefings Bioinform..

[25]  Olli Nevalainen,et al.  Pripper: prediction of caspase cleavage sites from whole proteomes , 2010, BMC Bioinformatics.

[26]  Hong-Bin Shen,et al.  LabCaS: Labeling calpain substrate cleavage sites from amino acid sequence using conditional random fields , 2013, Proteins.

[27]  Nicholas B Rego,et al.  3Dmol.js: molecular visualization with WebGL , 2014, Bioinform..

[28]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[29]  Kenichiro Imai,et al.  ScreenCap3: Improving prediction of caspase-3 cleavage sites using experimentally verified noncleavage sites , 2014, Proteomics.

[30]  Quan Zou,et al.  O‐GlcNAcPRED‐II: an integrated classification algorithm for identifying O‐GlcNAcylation sites based on fuzzy undersampling and a K‐means PCA oversampling technique , 2018, Bioinform..

[31]  Hyo Jin Kang,et al.  Structure of human alpha-enolase (hENO1), a multifunctional glycolytic enzyme. , 2008, Acta crystallographica. Section D, Biological crystallography.

[32]  M. Sanner,et al.  Reduced surface: an efficient way to compute molecular surfaces. , 1996, Biopolymers.

[33]  Jiangning Song,et al.  Quokka: a comprehensive tool for rapid and accurate prediction of kinase family‐specific phosphorylation sites in the human proteome , 2018, Bioinform..

[34]  Gholamreza Haffari,et al.  PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy , 2018, Bioinform..

[35]  J. Thornton,et al.  Satisfying hydrogen bonding potential in proteins. , 1994, Journal of molecular biology.

[36]  Christopher M. Overall,et al.  In search of partners: linking extracellular proteases to substrates , 2007, Nature Reviews Molecular Cell Biology.

[37]  Kathleen Marchal,et al.  Use of structural DNA properties for the prediction of transcription-factor binding sites in Escherichia coli , 2010, Nucleic Acids Res..

[38]  Geoffrey I. Webb,et al.  PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites , 2012, PloS one.

[39]  Geoffrey I. Webb,et al.  PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact , 2019, bioRxiv.

[40]  Xing-Ming Zhao,et al.  Cascleave 2.0, a new approach for predicting caspase and granzyme cleavage targets , 2014, Bioinform..

[41]  Jiangning Song,et al.  PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact. , 2019, Briefings in bioinformatics.

[42]  David Eisenberg,et al.  Crystal structures of truncated alphaA and alphaB crystallins reveal structural mechanisms of polydispersity important for eye lens function , 2010, Protein science : a publication of the Protein Society.

[43]  Geoffrey I. Webb,et al.  iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites , 2018, Briefings Bioinform..

[44]  B. Turk Targeting proteases: successes, failures and future prospects , 2006, Nature Reviews Drug Discovery.

[45]  Oliviero Carugo,et al.  DPX: for the analysis of the protein core , 2003, Bioinform..