Solart: A Structure-Based Method To Predict Protein Solubility And Aggregation

Motivation The solubility of a protein is often decisive for its proper functioning. Lack of solubility is a major bottleneck in high-throughput structural genomic studies and in high-concentration protein production, and the formation of protein aggregates causes a wide variety of diseases. Since solubility measurements are time-consuming and expensive, there is a strong need for solubility prediction tools. Results We have recently introduced solubility-dependent distance potentials that are able to unravel the role of residue-residue interactions in promoting or decreasing protein solubility. Here, we extended their construction by defining solubility-dependent potentials based on backbone torsion angles and solvent accessibility, and integrated them, together with other structure- and sequence-based features, into a random forest model trained on a set of E. coli proteins with experimental structures and solubility values. We thus obtained the SOLart protein solubility predictor, whose most informative features turned out to be folding free energy differences computed from our solubility-dependent statistical potentials. SOLart performances are very good, with a Pearson correlation coefficient between experimental and predicted solubility values of 0.7 both in the training dataset and on an independent set of S. Cerevisiae proteins. On test sets of modeled structures, only a limited drop in performance is observed. SOLart can thus be used with both high-resolution and low-resolution structures, and clearly outperforms state-of-art solubility predictors. It is available through a user-friendly webserver, which is easy to use by non-expert scientists. Availability The SOLart webserver is freely available at babylone.ulb.ac.be/SOLART/

[1]  F. Baneyx,et al.  Recombinant protein folding and misfolding in Escherichia coli , 2004, Nature Biotechnology.

[2]  Jim Warwicker,et al.  Lysine and Arginine Content of Proteins: Computational Analysis Suggests a New Tool for Solubility Design , 2013, Molecular pharmaceutics.

[3]  Marianne Rooman,et al.  Computational analysis of the amino acid interactions that promote or decrease protein solubility , 2018, Scientific Reports.

[4]  C. Dobson,et al.  Protein misfolding, functional amyloid, and human disease. , 2006, Annual review of biochemistry.

[5]  S. Wodak,et al.  Prediction of protein backbone conformation based on seven structure assignments. Influence of local interactions. , 1991, Journal of molecular biology.

[6]  D. Walsh,et al.  Protein Aggregation in the Brain: The Molecular Basis for Alzheimer’s and Parkinson’s Diseases , 2008, Molecular medicine.

[7]  Marianne Rooman,et al.  Predicting protein thermal stability changes upon point mutations using statistical potentials: Introducing HoTMuSiC , 2016, Scientific Reports.

[8]  Lu-Yun Lian,et al.  A simple method for improving protein solubility and long-term stability. , 2004, Journal of the American Chemical Society.

[9]  H. Taguchi,et al.  Large-scale aggregation analysis of eukaryotic proteins reveals an involvement of intrinsically disordered regions in protein folding , 2018, Scientific Reports.

[10]  Shoji Takada,et al.  Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins , 2009, Proceedings of the National Academy of Sciences.

[11]  Georgios A. Dalkas,et al.  Cation–π, amino–π, π–π, and H‐bond interactions stabilize antigen–antibody interfaces , 2014, Proteins.

[12]  David L. Wilkinson,et al.  Predicting the Solubility of Recombinant Proteins in Escherichia coli , 1991, Bio/Technology.

[13]  Robin Curtis,et al.  Protein–Sol: a web tool for predicting protein solubility from sequence , 2017, Bioinform..

[14]  B. McEwen,et al.  Alzheimer's and Parkinson's Diseases: Mechanisms, Clinical Strategies, and Promising Treatments of Neurodegenerative Diseases 11th International Conference AD/PDTM Florence, Italy, March 6-10, 2013: Abstracts , 2013, Neurodegenerative Diseases.

[15]  Jindan Zhou,et al.  EcoGene 3.0 , 2012, Nucleic Acids Res..

[16]  Michele Vendruscolo,et al.  The CamSol method of rational design of protein mutants with enhanced solubility. , 2015, Journal of molecular biology.

[17]  C. Pace,et al.  Measuring and increasing protein solubility. , 2008, Journal of pharmaceutical sciences.

[18]  Pierre Baldi,et al.  SOLpro: accurate sequence-based prediction of protein solubility , 2009, Bioinform..

[19]  Raghvendra Mall,et al.  DeepSol: a deep learning framework for sequence‐based protein solubility prediction , 2018, Bioinform..

[20]  Shuichi Hirose,et al.  ESPRESSO: A system for estimating protein expression and solubility in protein expression systems , 2013, Proteomics.

[21]  Vaibhav Upadhyay,et al.  Protein recovery from inclusion bodies of Escherichia coli using mild solubilization process , 2015, Microbial Cell Factories.

[22]  Marianne Rooman,et al.  Thermo- and mesostabilizing protein interactions identified by temperature-dependent statistical potentials. , 2010, Biophysical journal.

[23]  Christopher J Roberts,et al.  Therapeutic protein aggregation: mechanisms, design, and control. , 2014, Trends in biotechnology.

[24]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[25]  P. Tessier,et al.  Engineering aggregation-resistant antibodies. , 2012, Annual review of chemical and biomolecular engineering.

[26]  Torsten Schwede,et al.  The SWISS-MODEL Repository: new features and functionalities , 2005, Nucleic Acids Res..

[27]  Bhaskar D. Kulkarni,et al.  A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli , 2006, Bioinform..

[28]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[29]  C. Dobson,et al.  Inherent toxicity of aggregates implies a common mechanism for protein misfolding diseases , 2002, Nature.

[30]  Jim Warwicker,et al.  Soluble expression of proteins correlates with a lack of positively-charged surface , 2013, Scientific Reports.

[31]  David A. Lee,et al.  CATH: an expanded resource to predict protein function through structure and sequence , 2016, Nucleic Acids Res..

[32]  C. Ross,et al.  Protein aggregation and neurodegenerative disease , 2004, Nature Medicine.

[33]  Torsten Schwede,et al.  The SWISS-MODEL Repository—new features and functionality , 2016, Nucleic Acids Res..

[34]  Ursula Rinas,et al.  Microbial Cell Factories BioMed Central Review , 2003 .

[35]  S. Wodak,et al.  Factors influencing the ability of knowledge-based potentials to identify native sequence-structure matches. , 1994, Journal of molecular biology.

[36]  Harianto Tjong,et al.  Prediction of protein solubility from calculation of transfer free energy. , 2008, Biophysical journal.

[37]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[38]  Marianne Rooman,et al.  Protein Thermostability Prediction within Homologous Families Using Temperature-Dependent Statistical Potentials , 2014, PloS one.

[39]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[40]  S. Singh,et al.  Solubilization and refolding of bacterial inclusion body proteins. , 2005, Journal of bioscience and bioengineering.

[41]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[42]  G. Tang,et al.  Indian Hedgehog: A Mechanotransduction Mediator in Condylar Cartilage , 2004, Journal of dental research.

[43]  Antonio Villaverde,et al.  Learning about protein solubility from bacterial inclusion bodies , 2009, Microbial cell factories.

[44]  Dmitrij Frishman,et al.  Protein solubility: sequence based prediction and experimental verification , 2007, Bioinform..

[45]  C. Pace,et al.  Toward a molecular understanding of protein solubility: increased negative surface charge correlates with increased solubility. , 2012, Biophysical journal.

[46]  Philippe Bogaerts,et al.  Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0 , 2009, Bioinform..

[47]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[48]  Takuya Ueda,et al.  Protein synthesis by pure translation systems. , 2005, Methods.

[49]  Carmen Maria Livi,et al.  ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli , 2014, Bioinform..

[50]  D. Cirillo,et al.  cc SOL omics : a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli , 2014 .

[51]  Kyle Trainor,et al.  Exploring the relationships between protein sequence, structure and solubility. , 2017, Current opinion in structural biology.

[52]  Dmitrij Frishman,et al.  PROSO II – a new method for protein solubility prediction , 2012, The FEBS journal.

[53]  Marianne Rooman,et al.  SOLart: a structure-based method to predict protein solubility and aggregation , 2020, Bioinform..

[54]  D. Hunting,et al.  Probing the interactions of the solvated electron with DNA by molecular dynamics simulations: bromodeoxyuridine substituted DNA , 2008, Journal of molecular modeling.