HyperCys: A Structure- and Sequence-Based Predictor of Hyper-Reactive Druggable Cysteines

The cysteine side chain has a free thiol group, making it the amino acid residue most often covalently modified by small molecules possessing weakly electrophilic warheads, thereby prolonging on-target residence time and reducing the risk of idiosyncratic drug toxicity. However, not all cysteines are equally reactive or accessible. Hence, to identify targetable cysteines, we propose a novel ensemble stacked machine learning (ML) model to predict hyper-reactive druggable cysteines, named HyperCys. First, the pocket, conservation, structural and energy profiles, and physicochemical properties of (non)covalently bound cysteines were collected from both protein sequences and 3D structures of protein–ligand complexes. Then, we established the HyperCys ensemble stacked model by integrating six different ML models, including K-nearest neighbors, support vector machine, light gradient boost machine, multi-layer perceptron classifier, random forest, and the meta-classifier model logistic regression. Finally, based on the hyper-reactive cysteines’ classification accuracy and other metrics, the results for different feature group combinations were compared. The results show that the accuracy, F1 score, recall score, and ROC AUC values of HyperCys are 0.784, 0.754, 0.742, and 0.824, respectively, after performing 10-fold CV with the best window size. Compared to traditional ML models with only sequenced-based features or only 3D structural features, HyperCys is more accurate at predicting hyper-reactive druggable cysteines. It is anticipated that HyperCys will be an effective tool for discovering new potential reactive cysteines in a wide range of nucleophilic proteins and will provide an important contribution to the design of targeted covalent inhibitors with high potency and selectivity.

[1]  Aurélien F. A. Moumbock,et al.  CovPDB: a high-resolution coverage of the covalent protein–ligand interactome , 2021, Nucleic Acids Res..

[2]  F. Svensson,et al.  Structural Insights into Notum Covalent Inhibition , 2021, Journal of medicinal chemistry.

[3]  Debashree Bandyopadhyay,et al.  DeepCys: Structure‐based multiple cysteine function prediction method trained on deep neural network: Case study on domains of unknown functions belonging to COX2 domains , 2021, Proteins.

[4]  Lori A. Coburn,et al.  Dicarbonyl Electrophiles Mediate Inflammation-Induced Gastrointestinal Carcinogenesis. , 2020, Gastroenterology.

[5]  A. Zarrin,et al.  Kinase inhibition in autoimmunity and inflammation , 2020, Nature reviews. Drug discovery.

[6]  Zhen Cao,et al.  The lncLocator: a subcellular localization predictor for long non‐coding RNAs based on a stacked ensemble classifier , 2018, Bioinform..

[7]  Qingsong Liu,et al.  Oridonin is a covalent NLRP3 inhibitor with strong anti-inflammasome activity , 2018, Nature Communications.

[8]  Haobo Wang,et al.  Sequence-Based Prediction of Cysteine Reactivity Using Machine Learning. , 2017, Biochemistry.

[9]  Jianfeng Pei,et al.  Statistical Analysis and Prediction of Covalent Ligand Targeted Cysteine Residues , 2017, J. Chem. Inf. Model..

[10]  Jeyakumar Natarajan,et al.  Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases , 2016, J. Biomed. Informatics.

[11]  Sumaiya Iqbal,et al.  Estimation of Position Specific Energy as a Feature of Protein Residues from Sequence Alone for Structural Classification , 2016, PloS one.

[12]  S. Marino,et al.  Cy‐preds: An algorithm and a web service for the analysis and prediction of cysteine reactivity , 2016, Proteins.

[13]  Simon Mitternacht,et al.  FreeSASA: An open source C library for solvent accessible surface area calculations , 2016, F1000Research.

[14]  Sumaiya Iqbal,et al.  DisPredict: A Predictor of Disordered Protein Using Optimized RBF Kernel , 2015, PloS one.

[15]  Mallur S. Madhusudhan,et al.  Depth: a web server to compute depth, cavity sizes, detect potential small-molecule ligand-binding cavities and predict the pKa of ionizable residues in proteins , 2013, Nucleic Acids Res..

[16]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[17]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[18]  Jan H. Jensen,et al.  PROPKA3: Consistent Treatment of Internal and Surface Residues in Empirical pKa Predictions. , 2011, Journal of chemical theory and computation.

[19]  Xia Wang,et al.  Predicting the state of cysteines based on sequence information. , 2010, Journal of theoretical biology.

[20]  David Baker,et al.  Quantitative reactivity profiling predicts functional cysteines in proteomes , 2010, Nature.

[21]  Vincent Le Guilloux,et al.  Fpocket: An open source platform for ligand pocket detection , 2009, BMC Bioinformatics.

[22]  Wen-Lian Hsu,et al.  Predicting RNA-binding sites of proteins using support vector machines and evolutionary information , 2008, BMC Bioinformatics.

[23]  Peter Clote,et al.  DiANNA 1.1: an extension of the DiANNA web server for ternary cysteine classification , 2006, Nucleic Acids Res..

[24]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[25]  L. H. Bradley,et al.  Protein design by binary patterning of polar and nonpolar amino acids. , 1993, Methods in molecular biology.

[26]  Yaoqi Zhou,et al.  FreeSASA: An open source C library for solvent accessible surface area calculations , 2016, F1000Research.

[27]  A. Ciechanover,et al.  The ubiquitin-proteasome system in cardiovascular diseases-a hypothesis extended. , 2004, Cardiovascular research.

[28]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..