Protein solubility: sequence based prediction and experimental verification

MOTIVATION Obtaining soluble proteins in sufficient concentrations is a recurring limiting factor in various experimental studies. Solubility is an individual trait of proteins which, under a given set of experimental conditions, is determined by their amino acid sequence. Accurate theoretical prediction of solubility from sequence is instrumental for setting priorities on targets in large-scale proteomics projects. RESULTS We present a machine-learning approach called PROSO to assess the chance of a protein to be soluble upon heterologous expression in Escherichia coli based on its amino acid composition. The classification algorithm is organized as a two-layered structure in which the output of primary support vector machine (SVM) classifiers serves as input for a secondary Naive Bayes classifier. Experimental progress information from the TargetDB database as well as previously published datasets were used as the source of training data. In comparison with previously published methods our classification algorithm possesses improved discriminatory capacity characterized by the Matthews Correlation Coefficient (MCC) of 0.434 between predicted and known solubility states and the overall prediction accuracy of 72% (75 and 68% for positive and negative class, respectively). We also provide experimental verification of our predictions using solubility measurements for 31 mutational variants of two different proteins.

[1]  Bhaskar D. Kulkarni,et al.  A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli , 2006, Bioinform..

[2]  James C. Whisstock,et al.  The REFOLD database: a tool for the optimization of protein expression and refolding , 2005, Nucleic Acids Res..

[3]  Dmitrij Frishman,et al.  Will my protein crystallize? A sequence‐based predictor , 2005, Proteins.

[4]  Jaime Prilusky,et al.  FoldIndex copyright: a simple tool to predict whether a given protein sequence is intrinsically unfolded , 2005, Bioinform..

[5]  S. Singh,et al.  Solubilization and refolding of bacterial inclusion body proteins. , 2005, Journal of bioscience and bioengineering.

[6]  Susan Idicula-Thomas,et al.  Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[7]  Frances M. G. Pearl,et al.  The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis , 2004, Nucleic Acids Res..

[8]  D T Jones,et al.  Prediction of novel and analogous folds using fragment assembly and fold recognition , 2005, Proteins.

[9]  John M. Walker,et al.  The Proteomics Protocols Handbook , 2005, Humana Press.

[10]  J. Beckmann,et al.  FoldIndex©: a simple tool to predict whether a given protein sequence is intrinsically unfolded , 2005 .

[11]  John D. Westbrook,et al.  TargetDB: a target registration database for structural genomics projects , 2004, Bioinform..

[12]  David E Hill,et al.  High-throughput expression of C. elegans proteins. , 2004, Genome research.

[13]  Lu-Yun Lian,et al.  A simple method for improving protein solubility and long-term stability. , 2004, Journal of the American Chemical Society.

[14]  Mark Gerstein,et al.  Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. , 2004, Journal of molecular biology.

[15]  J. Onuchic,et al.  Theory of Protein Folding This Review Comes from a Themed Issue on Folding and Binding Edited Basic Concepts Perfect Funnel Landscapes and Common Features of Folding Mechanisms , 2022 .

[16]  Hongyi Zhou,et al.  Quantifying the effect of burial of amino acid residues on protein stability , 2003, Proteins.

[17]  Kouhei Tsumoto,et al.  Role of Arginine in Protein Refolding, Solubilization, and Purification , 2004, Biotechnology progress.

[18]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[19]  Joël Janin,et al.  Refolding strategies from inclusion bodies in a structural genomics project , 2004, Journal of Structural and Functional Genomics.

[20]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.

[21]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[22]  Ron D. Appel,et al.  ExPASy: the proteomics server for in-depth protein knowledge and analysis , 2003, Nucleic Acids Res..

[23]  G. Waldo,et al.  Genetic screens and directed evolution for protein solubility. , 2003, Current opinion in chemical biology.

[24]  T. Terwilliger,et al.  Engineering soluble proteins for structural genomics , 2002, Nature Biotechnology.

[25]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[26]  Mark Gerstein,et al.  SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics , 2001, Nucleic Acids Res..

[27]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[28]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[29]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[30]  J. Frydman Folding of newly translated proteins in vivo: the role of molecular chaperones. , 2001, Annual review of biochemistry.

[31]  Cheryl H. Arrowsmith,et al.  Protein production: feeding the crystallographers and NMR spectroscopists , 2000, Nature Structural Biology.

[32]  Mark Gerstein,et al.  Structural proteomics of an archaeon , 2000, Nature Structural Biology.

[33]  Geoffrey Holmes,et al.  Benchmarking attribute selection techniques for data mining , 2000 .

[34]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[35]  R G Harrison,et al.  New fusion protein systems designed to give soluble expression in Escherichia coli. , 1999, Biotechnology and bioengineering.

[36]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[37]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[38]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[39]  Eric Gouaux,et al.  A new protein folding screen: Application to the ligand binding domains of a glutamate and kainate receptor and to lysozyme and carbonic anhydrase , 1999, Protein science : a publication of the Protein Society.

[40]  D. Waugh,et al.  Escherichia coli maltose‐binding protein is uncommonly effective at promoting the solubility of polypeptides to which it is fused , 1999, Protein science : a publication of the Protein Society.

[41]  R D Appel,et al.  Protein identification and analysis tools in the ExPASy server. , 1999, Methods in molecular biology.

[42]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[43]  S. Makrides Strategies for achieving high-level expression of genes in Escherichia coli. , 1996, Microbiological reviews.

[44]  G. Georgiou,et al.  Expression of correctly folded proteins in Escherichia coli. , 1996, Current opinion in biotechnology.

[45]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[46]  G. Dale,et al.  Improving protein solubility through rationally designed amino acid replacements: solubilization of the trimethoprim-resistant type S1 dihydrofolate reductase. , 1994, Protein engineering.

[47]  R. Seckler,et al.  Protein folding and protein refolding , 1992, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[48]  David L. Wilkinson,et al.  Predicting the Solubility of Recombinant Proteins in Escherichia coli , 1991, Bio/Technology.

[49]  M. W. Pandit,et al.  Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. , 1990, Protein engineering.

[50]  R. Sousa,et al.  The use of glycerol in crystallization of T7 RNA polymerase: Implications for the use of cosolvents in crystallizing flexible proteins , 1990 .

[51]  C. Stoscheck,et al.  Quantitation of protein. , 1990, Methods in enzymology.

[52]  P. V. von Hippel,et al.  Calculation of protein extinction coefficients from amino acid sequence data. , 1989, Analytical biochemistry.

[53]  J. Gibrat,et al.  Secondary structure prediction: combination of three different methods. , 1988, Protein engineering.

[54]  M A Roseman,et al.  Hydrophilicity of polar amino acid side-chains is markedly reduced by flanking peptide bonds. , 1988, Journal of molecular biology.

[55]  J. Szulmajster Protein folding , 1988, Bioscience reports.

[56]  M. Charton,et al.  The structural dependence of amino acid hydrophobicity parameters. , 1982, Journal of theoretical biology.

[57]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[58]  A Ikai,et al.  Thermostability and aliphatic index of globular proteins. , 1980, Journal of biochemistry.

[59]  P. Ponnuswamy,et al.  Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. , 1980, Biochimica et biophysica acta.

[60]  A. Komoriya,et al.  Local interactions as a structure determinant for protein molecules: III. , 1979, Biochimica et biophysica acta.

[61]  A. Komoriya,et al.  Local interactions as a structure determinant for protein molecules: II. , 1979, Biochimica et biophysica acta.

[62]  M. Levitt,et al.  Conformation of amino acid side-chains in proteins. , 1978, Journal of molecular biology.

[63]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[64]  C. Tanford,et al.  The solubility of amino acids and two glycine peptides in aqueous ethanol and dioxane solutions. Establishment of a hydrophobicity scale. , 1971, The Journal of biological chemistry.

[65]  Ennis Layne,et al.  SPECTROPHOTOMETRIC AND TURBIDIMETRIC METHODS FOR MEASURING PROTEINS , 1957 .

[66]  THE SOLUBILITY , 2022 .

[67]  I. Song,et al.  Working Set Selection Using Second Order Information for Training Svm, " Complexity-reduced Scheme for Feature Extraction with Linear Discriminant Analysis , 2022 .