BCrystal: an interpretable sequence-based protein crystallization predictor

MOTIVATION X-ray crystallography has facilitated the majority of protein structures determined to date. Sequence-based predictors that can accurately estimate protein crystallization propensities would be highly beneficial to overcome the high expenditure, large attrition rate, and to reduce the trial-and-error settings required for crystallization. MODEL In this study, we present a novel model, BCrystal, which uses an optimized gradient boosting machine (XGBoost) on sequence, structural, and physio-chemical features extracted from the proteins of interest. BCrystal also provides explanations, highlighting the most important features for the predicted crystallization propensity of an individual protein using the SHAP algorithm. RESULTS On three independent test sets, BCrystal outperforms state-of-the-art sequence-based methods by more than 12.5% in accuracy, 18% in recall and 0.253 in Matthew's correlation coefficient, with an average accuracy of 93.7%, recall of 96.63% and Matthew's correlation coefficient of 0.868. For relative solvent accessibility of exposed residues, we observed higher values to associate positively with protein crystallizability and the number of disordered regions, fraction of coils and tripeptide stretches that contain multiple histidines associate negatively with crystallizability. The higher accuracy of BCrystal enables it to accurately screen for sequence variants with enhanced crystallizability. AVAILABILITY Our BCrystal webserver is at: https://machinelearning-protein.qcri.org/ and source code is available at: https://github.com/raghvendra5688/BCrystal. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Bernhard Rupp,et al.  Protein stability: a crystallographer’s perspective , 2016, Acta crystallographica. Section F, Structural biology communications.

[2]  Chen Wang,et al.  fDETECT webserver: fast predictor of propensity for protein production, purification, and crystallization , 2017, BMC Bioinformatics.

[3]  Angelos D. Keromytis,et al.  Back in Black: Towards Formal, Black Box Analysis of Sanitizers and Filters , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[4]  Adam Godzik,et al.  Improving the chances of successful protein structure determination with a random forest classifier. , 2014, Acta crystallographica. Section D, Biological crystallography.

[5]  H. Noushmehr,et al.  RGBM: regularized gradient boosting machines for identification of the transcriptional regulators of discrete glioma subtypes , 2018, Nucleic acids research.

[6]  L. Shapley A Value for n-person Games , 1988 .

[7]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[8]  Jun Hu,et al.  TargetCrys: protein crystallization prediction by fusing multi-view features with two-layered SVM , 2016, Amino Acids.

[9]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Raghvendra Mall,et al.  PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine , 2018, Bioinform..

[11]  Lukasz Kurgan,et al.  On the relation between residue flexibility and local solvent accessibility in proteins , 2009, Proteins.

[12]  Austin G. Meyer,et al.  Maximum Allowed Solvent Accessibilites of Residues in Proteins , 2012, PloS one.

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[15]  David D. Denison,et al.  Nonlinear estimation and classification , 2003 .

[16]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[17]  Jim Warwicker,et al.  Soluble expression of proteins correlates with a lack of positively-charged surface , 2013, Scientific Reports.

[18]  Erik Strumbelj,et al.  Explaining prediction models and individual predictions with feature contributions , 2014, Knowledge and Information Systems.

[19]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[20]  Shinn-Ying Ho,et al.  SCMCRYS: Predicting Protein Crystallization Using an Ensemble Scoring Card Method with Estimating Propensity Scores of P-Collocated Amino Acid Pairs , 2013, PloS one.

[21]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[22]  Jiangning Song,et al.  PredPPCrys: Accurate Prediction of Sequence Cloning, Protein Production, Purification and Crystallization Propensity from Protein Sequences Using Multi-Step Heterogeneous Feature Fusion and Selection , 2014, PloS one.

[23]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[24]  J. Wendoloski,et al.  Molecular factors stabilizing protein crystals , 1988 .

[25]  Gábor E. Tusnády,et al.  TMCrys: predict propensity of success for transmembrane protein crystallization , 2018, Bioinform..

[26]  Jie Hou,et al.  DeepSF: deep convolutional neural network for mapping protein sequences to folds , 2017, Bioinform..

[27]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[28]  John MacInnes ‘Discussion: Towards More Accessible Conceptions of Statistical Inference’ , 2011 .

[29]  Laurene V. Fausett,et al.  Fundamentals Of Neural Networks , 1994 .

[30]  Huan‐Xiang Zhou,et al.  Prediction of solvent accessibility and sites of deleterious mutations from protein sequence , 2005, Nucleic acids research.

[31]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[32]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[33]  Thomas C Terwilliger,et al.  Lessons from structural genomics. , 2009, Annual review of biophysics.

[34]  Tamer Kahveci,et al.  Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics , 2017, BCB.

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Raghvendra Mall,et al.  DeepSol: a deep learning framework for sequence‐based protein solubility prediction , 2018, Bioinform..

[37]  Charu C. Aggarwal,et al.  Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2016, KDD.

[38]  Huilin Wang,et al.  Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity , 2017, Briefings Bioinform..

[39]  Ehsan Ullah,et al.  An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity , 2018, F1000Research.

[40]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[41]  S. Lipovetsky,et al.  Analysis of regression in game theory approach , 2001 .

[42]  Yair Zick,et al.  Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[43]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[44]  Liubin Feng,et al.  Crysalis: an integrated server for computational analysis and design of protein crystallization , 2016, Scientific Reports.

[45]  Raghvendra Mall,et al.  DeepCrystal: A Deep Learning Framework for Sequence-based Protein Crystallization Prediction , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[46]  Matthew Regan,et al.  Towards more accessible conceptions of statistical inference , 2011 .

[47]  Martin Hammarström,et al.  His tag effect on solubility of human proteins produced in Escherichia coli: a comparison between four expression vectors , 2004, Journal of Structural and Functional Genomics.

[48]  Lukasz Kurgan,et al.  Sequence-Based Protein Crystallization Propensity Prediction for Structural Genomics: Review and Comparative Analysis , 2009 .

[49]  Bernard F. Buxton,et al.  The DISOPRED server for the prediction of protein disorder , 2004, Bioinform..