Prediction of protein solubility in E. coli

Gene synthesis is a key step to convert digitally predicted proteins to functional proteins. However, it is a relatively expensive and labor-intensive process. About 30-50% of the synthesized proteins are not soluble, thereby further reduces the efficacy of gene synthesis as a method for protein function characterization. Solubility prediction from primary protein sequences holds the promise to dramatically reduce the cost of gene synthesis. This work presents a framework that creates models of solubility from sequence information. From the primary protein sequences of the genes to be synthesized, sequence features can be used to build computational models for solubility. This way, biologists can focus the effort on synthesizing genes that are highly likely to generate soluble proteins. We have developed a framework that employs several machine learning algorithms to model protein solubility. The framework is used to predict protein solubility in the Escherichia coli expression system. The analysis is performed on over 1,600 quantified proteins. The approach successfully predicted the solubility with more than 80% accuracy, and enabled in depth analysis of the most important features affecting solubility. The analysis pipeline is general and can be applied to any set of sequence features to predict any binary measure. The framework also provides the biologist with a comprehensive comparison between different learning algorithms, and insightful feature analysis.

[1]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[2]  Susan Idicula-Thomas,et al.  Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[3]  Bhaskar D. Kulkarni,et al.  A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli , 2006, Bioinform..

[4]  Dmitrij Frishman,et al.  Protein solubility: sequence based prediction and experimental verification , 2007, Bioinform..

[5]  Shuichi Hirose,et al.  Statistical analysis of features associated with protein expression/solubility in an in vivo Escherichia coli expression system and a wheat germ cell-free expression system. , 2011, Journal of biochemistry.

[6]  Pankaj Kumar,et al.  Granular Support Vector Machine Based Method for Prediction of Solubility of Proteins on Overexpression in Escherichia Coli , 2007, PReMI.

[7]  Mark Gerstein,et al.  SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics , 2001, Nucleic Acids Res..

[8]  Pierre Baldi,et al.  SOLpro: accurate sequence-based prediction of protein solubility , 2009, Bioinform..

[9]  H. Mori,et al.  Complete set of ORF clones of Escherichia coli ASKA library (a complete set of E. coli K-12 ORF archive): unique resources for biological research. , 2006, DNA research : an international journal for rapid publication of reports on genes and genomes.

[10]  Michele Vendruscolo,et al.  Sequence-based prediction of protein solubility. , 2012, Journal of molecular biology.

[11]  Shoji Takada,et al.  Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins , 2009, Proceedings of the National Academy of Sciences.