Prediction of protein solubility in Escherichia coli using logistic regression

In this article we present a new and more accurate model for the prediction of the solubility of proteins overexpressed in the bacterium Escherichia coli. The model uses the statistical technique of logistic regression. To build this model, 32 parameters that could potentially correlate well with solubility were used. In addition, the protein database was expanded compared to those used previously. We tested several different implementations of logistic regression with varied results. The best implementation, which is the one we report, exhibits excellent overall prediction accuracies: 94% for the model and 87% by cross‐validation. For comparison, we also tested discriminant analysis using the same parameters, and we obtained a less accurate prediction (69% cross‐validation accuracy for the stepwise forward plus interactions model). Biotechnol. Bioeng. 2010; 105: 374–383. © 2009 Wiley Periodicals, Inc.

[1]  S. Singh,et al.  Solubilization and refolding of bacterial inclusion body proteins. , 2005, Journal of bioscience and bioengineering.

[2]  John L. Klepeis,et al.  Free energy calculations for peptides via deterministic global optimization , 1999 .

[3]  J. McCafferty,et al.  Production of soluble mammalian proteins in Escherichia coli: identification of protein features that correlate with successful expression , 2004, BMC biotechnology.

[4]  W. Jenkins Three solutions of the protein solubility problem. , 1998, Protein science : a publication of the Protein Society.

[5]  Tetsuya Hayashi,et al.  Escherichia coli , 1983, CABI Compendium.

[6]  Zhong Wang,et al.  Prediction of protein solubility in E. coli , 2012, 2012 IEEE 8th International Conference on E-Science.

[7]  Bhaskar D. Kulkarni,et al.  A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli , 2006, Bioinform..

[8]  G Georgiou,et al.  Secondary structure characterization of beta-lactamase inclusion bodies. , 1994, Protein engineering.

[9]  H A Scheraga,et al.  Recent developments in the theory of protein folding: searching for the global energy minimum. , 1996, Biophysical chemistry.

[10]  J L Klepeis,et al.  Hybrid global optimization algorithms for protein structure prediction: alternating hybrids. , 2003, Biophysical journal.

[11]  P. Bradley,et al.  Toward High-Resolution de Novo Structure Prediction for Small Proteins , 2005, Science.

[12]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[13]  F. Baneyx,et al.  Recombinant protein folding and misfolding in Escherichia coli , 2004, Nature Biotechnology.

[14]  Christopher M. Dobson,et al.  Kinetic partitioning of protein folding and aggregation , 2002, Nature Structural Biology.

[15]  M. Roessle,et al.  The boxing glove shape of subunit d of the yeast V-ATPase in solution and the importance of disulfide formation for folding of this protein , 2007, Journal of bioenergetics and biomembranes.

[16]  P. Groves Work (made) for hire , 2011 .

[17]  P. Y. Chou,et al.  Prediction of the secondary structure of proteins from their amino acid sequence. , 2006 .

[18]  D. Williams,et al.  Cytoplasmic inclusion bodies in Escherichia coli producing biosynthetic human insulin proteins. , 1982, Science.

[19]  Todd M. Przybycien,et al.  Secondary structure characterization of beta-lactamase inclusion bodies. , 1994, Protein engineering.

[20]  C. Schein,et al.  Formation of Soluble Recombinant Proteins in Escherichia Coli is Favored by Lower Growth Temperature , 1988, Bio/Technology.

[21]  David L. Wilkinson,et al.  Predicting the Solubility of Recombinant Proteins in Escherichia coli , 1991, Bio/Technology.

[22]  Susan Idicula-Thomas,et al.  Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[23]  Bernd Hartke,et al.  Towards protein folding with evolutionary techniques , 2005, J. Comput. Chem..

[24]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[25]  M. Galleni,et al.  Escherichia coli fusion carrier proteins act as solubilizing agents for recombinant uncoupling protein 1 through interactions with GroEL. , 2005, Biochemical and biophysical research communications.

[26]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[27]  R G Harrison,et al.  New fusion protein systems designed to give soluble expression in Escherichia coli. , 1999, Biotechnology and bioengineering.

[28]  S L Mayo,et al.  Intrinsic beta-sheet propensities result from van der Waals interactions between side chains and the local backbone. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[29]  K. Dill Dominant forces in protein folding. , 1990, Biochemistry.

[30]  C. Pace,et al.  A helix propensity scale based on experimental studies of peptides and proteins. , 1998, Biophysical journal.

[31]  K. R. Woods,et al.  Prediction of protein antigenic determinants from amino acid sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Johannes Buchner,et al.  Molecular chaperones--cellular machines for protein folding. , 2002, Angewandte Chemie.

[33]  A. Parente,et al.  Kinetics of amyloid aggregation of mammal apomyoglobins and correlation with their amino acid sequences , 2006, FEBS letters.

[34]  J. King,et al.  Frequencies of amino acid strings in globular protein sequences indicate suppression of blocks of consecutive hydrophobic residues , 2001, Protein science : a publication of the Protein Society.

[35]  Dmitrij Frishman,et al.  Phenylalanine promotes interaction of transmembrane domains via GxxxG motifs. , 2007, Journal of molecular biology.

[36]  R. Murphy,et al.  Misbehaving proteins : protein (mis)folding, aggregation, and stability , 2006 .