Discrimination of soluble and aggregation-prone proteins based on sequence information.

Understanding the factors governing protein solubility is a key to grasp the mechanisms of protein solubility and may provide insight into protein aggregation and misfolding related diseases such as Alzheimer's disease. In this work, we attempt to identify factors important to protein solubility using feature selection. Firstly, we calculate 1438 features including physicochemical properties and statistics for each protein. Random Forest algorithm is used to select the most informative and the minimal subset of features based on their predictive performance. A predictive model is built based on 17 selected features. Compared with previous models, our model achieves better performance with a sensitivity of 0.82, specificity 0.85, ACC 0.84, AUC 0.91 and MCC 0.67. Furthermore, a model using a redundancy-reduced dataset (sequence identity <= 30%) achieves the same performance as the model without redundancy reduction. Our results provide not only a reliable model for predicting protein solubility but also a list of features important to protein solubility. The predictive model is implemented as a freely available web application at .

[1]  Dmitrij Frishman,et al.  PROSO II – a new method for protein solubility prediction , 2012, The FEBS journal.

[2]  Pierre Baldi,et al.  SOLpro: accurate sequence-based prediction of protein solubility , 2009, Bioinform..

[3]  T. Gibson,et al.  Protein disorder prediction: implications for structural proteomics. , 2003, Structure.

[4]  John D. Westbrook,et al.  TargetDB: a target registration database for structural genomics projects , 2004, Bioinform..

[5]  Ramón Díaz-Uriarte,et al.  GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest , 2007, BMC Bioinformatics.

[6]  Susan Idicula-Thomas,et al.  Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[7]  Jianwen Fang,et al.  Predicting residue-residue contacts using random forest models , 2011, Bioinform..

[8]  Dmitrij Frishman,et al.  Protein solubility: sequence based prediction and experimental verification , 2007, Bioinform..

[9]  A. Mitra,et al.  Transporter targeted drug delivery , 2010 .

[10]  Bernhardt L Trout,et al.  Prediction of aggregation prone regions of therapeutic proteins. , 2010, The journal of physical chemistry. B.

[11]  R G Harrison,et al.  New fusion protein systems designed to give soluble expression in Escherichia coli. , 1999, Biotechnology and bioengineering.

[12]  Bhaskar D. Kulkarni,et al.  A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli , 2006, Bioinform..

[13]  Michail Yu. Lobanov,et al.  Prediction of Amyloidogenic and Disordered Regions in Protein Chains , 2006, PLoS Comput. Biol..

[14]  Amedeo Caflisch,et al.  Prediction of aggregation rate and aggregation‐prone segments in polypeptide sequences , 2005, Protein science : a publication of the Protein Society.

[15]  Mark Gerstein,et al.  Structural proteomics of an archaeon , 2000, Nature Structural Biology.

[16]  Frank Eisenhaber,et al.  Improved strategy in analytic surface calculation for molecular systems: Handling of singularities and computational efficiency , 1993, J. Comput. Chem..

[17]  R D Appel,et al.  Protein identification and analysis tools in the ExPASy server. , 1999, Methods in molecular biology.

[18]  Liangjiang Wang,et al.  Prediction of DNA-binding residues from protein sequence information using random forests , 2009, BMC Genomics.

[19]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[20]  P. Kokol,et al.  Comprehensive Decision Tree Models in Bioinformatics , 2012, PloS one.

[21]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..

[22]  L. Kier,et al.  Amino acid side chain parameters for correlation studies in biology and pharmacology. , 2009, International journal of peptide and protein research.

[23]  C. Pace,et al.  Protein structure, stability and solubility in water and other solvents. , 2004, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[24]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[25]  Stephen H. White,et al.  Experimentally determined hydrophobicity scale for proteins at membrane interfaces , 1996, Nature Structural Biology.

[26]  David L. Wilkinson,et al.  Predicting the Solubility of Recombinant Proteins in Escherichia coli , 1991, Bio/Technology.

[27]  Shoji Takada,et al.  Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins , 2009, Proceedings of the National Academy of Sciences.

[28]  C. Dobson,et al.  Protein solubility and protein homeostasis: a generic view of protein misfolding disorders. , 2011, Cold Spring Harbor perspectives in biology.

[29]  M. Oobatake,et al.  An analysis of non-bonded energy of proteins. , 1977, Journal of theoretical biology.

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  Harianto Tjong,et al.  Prediction of protein solubility from calculation of transfer free energy. , 2008, Biophysical journal.

[32]  Michael J. Geisow,et al.  Amino acid preferences for secondary structure vary with protein class , 1980 .

[33]  Michele Vendruscolo,et al.  Prediction of "aggregation-prone" and "aggregation-susceptible" regions in proteins associated with neurodegenerative diseases. , 2005, Journal of molecular biology.

[34]  A. Edwards,et al.  Structural proteomics: toward high-throughput structural biology as a tool in functional genomics. , 2003, Accounts of chemical research.

[35]  M. Kanehisa,et al.  Prediction of protein function from sequence properties. Discriminant analysis of a data base. , 1984, Biochimica et biophysica acta.

[36]  Francesc X. Avilés,et al.  AGGRESCAN: a server for the prediction and evaluation of "hot spots" of aggregation in polypeptides , 2007, BMC Bioinform..

[37]  P. Karplus,et al.  Prediction of chain flexibility in proteins , 1985, Naturwissenschaften.

[38]  Paul A Dalby,et al.  Thermodynamic parameters for salt‐induced reversible protein precipitation from automated microscale experiments , 2011, Biotechnology and bioengineering.

[39]  D. Wishart,et al.  An NMR approach to structural proteomics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[40]  B. Zaslavsky,et al.  Measurement of relative hydrophobicity of amino acid side-chains by partition in an aqueous two-phase polymeric system: Hydrophobicity scale for non-polar and ionogenic side-chains , 1982 .

[41]  K. Nishikawa,et al.  Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. , 2001, Journal of molecular biology.

[42]  P E Bourne,et al.  The Protein Data Bank. , 2002, Nucleic acids research.

[43]  Mark Gerstein,et al.  Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. , 2004, Journal of molecular biology.