Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map

Motivation Protein solubility is significant in producing new soluble proteins that can reduce the cost of biocatalysts or therapeutic agents. Therefore, a computational model is highly desired to accurately predict protein solubility from the amino acid sequence. Many methods have been developed, but they are mostly based on the one-dimensional embedding of amino acids that is limited to catch spatially structural information. Results In this study, we have developed a new structure-aware method to predict protein solubility by attentive graph convolutional network (GCN), where the protein topology attribute graph was constructed through predicted contact maps from the sequence. GraphSol was shown to substantially out-perform other sequence-based methods. The model was proven to be stable by consistent R2 of 0.48 in both the cross-validation and independent test of the eSOL dataset. To our best knowledge, this is the first study to utilize the GCN for sequence-based predictions. More importantly, this architecture could be extended to other protein prediction tasks. Availability The package is available at http://biomed.nscc-gz.cn Contact yangyd25@mail.sysu.edu.cn Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Jianwen Fang,et al.  Discrimination of soluble and aggregation-prone proteins based on sequence information. , 2013, Molecular bioSystems.

[2]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[3]  Raghvendra Mall SolXplain: An Explainable Sequence-Based Protein Solubility Predictor , 2019, bioRxiv.

[4]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .

[5]  M. Rooman,et al.  Solart: A Structure-Based Method To Predict Protein Solubility And Aggregation , 2019, bioRxiv.

[6]  Michele Vendruscolo,et al.  Sequence-based prediction of protein solubility. , 2012, Journal of molecular biology.

[7]  Jens Meiler,et al.  Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks , 2001 .

[8]  Siti Zaiton Mohd Hashim,et al.  A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli , 2014, BMC Bioinformatics.

[9]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[10]  Yuedong Yang,et al.  To Improve Protein Sequence Profile Prediction through Image Captioning on Pairwise Residue Distance Map , 2019, bioRxiv.

[11]  Chun-Nan Hsu,et al.  Learning to predict expression efficacy of vectors in recombinant protein production , 2010, BMC Bioinformatics.

[12]  Raghvendra Mall,et al.  DeepSol: a deep learning framework for sequence‐based protein solubility prediction , 2018, Bioinform..

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Takuya Ueda,et al.  Protein synthesis by pure translation systems. , 2005, Methods.

[15]  Shoji Takada,et al.  Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins , 2009, Proceedings of the National Academy of Sciences.

[16]  Raghvendra Mall,et al.  PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine , 2018, Bioinform..

[17]  Xiaonan Wang,et al.  Develop machine learning-based regression predictive models for engineering protein solubility , 2019, Bioinform..

[18]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machines , 2002 .

[19]  Dmitrij Frishman,et al.  PROSO II – a new method for protein solubility prediction , 2012, The FEBS journal.

[20]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[21]  Pierre Baldi,et al.  SOLpro: accurate sequence-based prediction of protein solubility , 2009, Bioinform..

[22]  Parameswaran Binod,et al.  Strategies for design of improved biocatalysts for industrial applications. , 2017, Bioresource technology.

[23]  Kuldip K. Paliwal,et al.  Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks , 2018, Bioinform..

[24]  Kang Zhou,et al.  ProGAN: Protein solubility generative adversarial nets for data augmentation in DNN framework , 2019, Comput. Chem. Eng..

[25]  Silvia Crivelli,et al.  Structural Learning of Proteins Using Graph Convolutional Neural Networks , 2019, bioRxiv.

[26]  Robin Curtis,et al.  Protein–Sol: a web tool for predicting protein solubility from sequence , 2017, Bioinform..

[27]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[28]  Richard Bonneau,et al.  deepNF: deep network fusion for protein function prediction , 2017, bioRxiv.

[29]  Wen-Liang Chen,et al.  Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition , 2012, BMC Bioinformatics.

[30]  Yuedong Yang,et al.  Identifying Structure-Property Relationships through SMILES Syntax Analysis with Self-Attention Mechanism , 2018, J. Chem. Inf. Model..

[31]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[32]  Kuldip K. Paliwal,et al.  Capturing non‐local interactions by long short‐term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility , 2017, Bioinform..

[33]  I. A. Emerson,et al.  Protein contact maps: A binary depiction of protein 3D structures , 2017 .

[34]  Andriy Kryshtafovych,et al.  Assessment of contact predictions in CASP12: Co‐evolution and deep learning coming of age , 2017, Proteins.

[35]  K. Zhou,et al.  Develop machine learning based predictive models for engineering protein solubility , 2018, 1806.11369.

[36]  David W. Mount,et al.  Using BLOSUM in Sequence Alignments. , 2008, CSH protocols.

[37]  Yongjian Li,et al.  Predicting drug–protein interaction using quasi-visual question answering system , 2019, Nature Machine Intelligence.

[38]  Zhong Wang,et al.  Prediction of protein solubility in E. coli , 2012, 2012 IEEE 8th International Conference on E-Science.