A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

BackgroundOver the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods.ResultsThis paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end.ConclusionsThis study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving labour, time and cost.

[1]  Jin-Kao Hao,et al.  Pattern Recognition in Bioinformatics , 2013, Lecture Notes in Computer Science.

[2]  Pierre Baldi,et al.  SOLpro: accurate sequence-based prediction of protein solubility , 2009, Bioinform..

[3]  Philip E. Bourne,et al.  The RCSB PDB information portal for structural genomics , 2005, Nucleic Acids Res..

[4]  Emanuele Tomba,et al.  Prediction of protein solubility in Escherichia coli using logistic regression , 2010, Biotechnology and bioengineering.

[5]  Jack Sklansky,et al.  On Automatic Feature Selection , 1988, Int. J. Pattern Recognit. Artif. Intell..

[6]  John D. Westbrook,et al.  TargetDB: a target registration database for structural genomics projects , 2004, Bioinform..

[7]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[8]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[9]  David E Hill,et al.  High-throughput expression of C. elegans proteins. , 2004, Genome research.

[10]  Wen-Liang Chen,et al.  Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition , 2012, BMC Bioinformatics.

[11]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[12]  Pankaj Kumar,et al.  Granular Support Vector Machine Based Method for Prediction of Solubility of Proteins on Overexpression in Escherichia Coli , 2007, PReMI.

[13]  Mark Gerstein,et al.  Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. , 2004, Journal of molecular biology.

[14]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[15]  Dmitrij Frishman,et al.  Protein solubility: sequence based prediction and experimental verification , 2007, Bioinform..

[16]  R G Harrison,et al.  New fusion protein systems designed to give soluble expression in Escherichia coli. , 1999, Biotechnology and bioengineering.

[17]  Mark Gerstein,et al.  SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics , 2001, Nucleic Acids Res..

[18]  BMC Bioinformatics , 2005 .

[19]  Bernhard Schölkopf,et al.  Feature selection and transduction for prediction of molecular bioactivity for drug design , 2003, Bioinform..

[20]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[21]  Marcel J. T. Reinders,et al.  Exploring Sequence Characteristics Related to High-Level Production of Secreted Proteins in Aspergillus niger , 2012, PloS one.

[22]  Thomas G. Dietterich Editorial Exploratory research in machine learning , 1990, Machine Learning.

[23]  Susan Idicula-Thomas,et al.  Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[24]  J. Kittler,et al.  Feature Set Search Alborithms , 1978 .

[25]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[26]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[27]  Mark Gerstein,et al.  Structural proteomics of an archaeon , 2000, Nature Structural Biology.

[28]  Shoji Takada,et al.  Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins , 2009, Proceedings of the National Academy of Sciences.

[29]  Chun-Nan Hsu,et al.  Learning to predict expression efficacy of vectors in recombinant protein production , 2010, BMC Bioinformatics.

[30]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[31]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[32]  Shuichi Hirose,et al.  Statistical analysis of features associated with protein expression/solubility in an in vivo Escherichia coli expression system and a wheat germ cell-free expression system. , 2011, Journal of biochemistry.

[33]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[34]  Jiangning Song,et al.  Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction , 2014, Briefings Bioinform..

[35]  Peter Kokol,et al.  Stability of different feature selection methods for selecting protein sequence descriptors in protein solubility classification problem , 2010, 2010 IEEE 23rd International Symposium on Computer-Based Medical Systems (CBMS).

[36]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[37]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[38]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[39]  Dmitrij Frishman,et al.  PROSO II – a new method for protein solubility prediction , 2012, The FEBS journal.

[40]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[41]  Shuichi Hirose,et al.  ESPRESSO: A system for estimating protein expression and solubility in protein expression systems , 2013, Proteomics.

[42]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[43]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[44]  Zhong Wang,et al.  Prediction of protein solubility in E. coli , 2012, 2012 IEEE 8th International Conference on E-Science.

[45]  Chi Hau Chen,et al.  Pattern recognition and signal processing , 1978 .

[46]  Jianwen Fang,et al.  Discrimination of soluble and aggregation-prone proteins based on sequence information. , 2013, Molecular bioSystems.

[47]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[48]  David L. Wilkinson,et al.  Predicting the Solubility of Recombinant Proteins in Escherichia coli , 1991, Bio/Technology.

[49]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[50]  L. Jiang,et al.  PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[51]  Sumio Sugano,et al.  Human Gene and Protein Database (HGPD): a novel database presenting a large quantity of experiment-based results in human proteomics , 2009, Nucleic Acids Res..

[52]  Michele Vendruscolo,et al.  Sequence-based prediction of protein solubility. , 2012, Journal of molecular biology.

[53]  Bhaskar D. Kulkarni,et al.  A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli , 2006, Bioinform..

[54]  P. Kokol,et al.  Comprehensive Decision Tree Models in Bioinformatics , 2012, PloS one.

[55]  L. N. Kanal,et al.  Handbook of Statistics, Vol. 2. Classification, Pattern Recognition and Reduction of Dimensionality. , 1985 .

[56]  Feng Shi,et al.  Predicting the protein solubility by integrating chaos games representation and entropy in information theory , 2014, Expert Syst. Appl..