Learning to predict expression efficacy of vectors in recombinant protein production

BackgroundRecombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression.ResultsIn this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production.ConclusionIn this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.

[1]  Yan-Ping Shih,et al.  Self‐cleavage of fusion protein in vivo using TEV protease to yield native protein , 2005, Protein science : a publication of the Protein Society.

[2]  Bhaskar D. Kulkarni,et al.  A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli , 2006, Bioinform..

[3]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[4]  Antonio Villaverde,et al.  Recombinant protein solubility—does more mean better? , 2007, Nature Biotechnology.

[5]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[6]  Per Jonasson,et al.  Genetic design for facilitated production and recovery of recombinant proteins in Escherichia coli , 2002, Biotechnology and applied biochemistry.

[7]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[8]  D. Smith,et al.  Single-step purification of polypeptides expressed in Escherichia coli as fusions with glutathione S-transferase. , 1988, Gene.

[9]  J. Mccoy,et al.  A Thioredoxin Gene Fusion Expression System That Circumvents Inclusion Body Formation in the E. coli Cytoplasm , 1993, Bio/Technology.

[10]  D. Waugh,et al.  Escherichia coli maltose‐binding protein is uncommonly effective at promoting the solubility of polypeptides to which it is fused , 1999, Protein science : a publication of the Protein Society.

[11]  Po-Huang Liang,et al.  Parallel gene cloning and protein production in multiple expression systems , 2009, Biotechnology progress.

[12]  Yan-Ping Shih,et al.  High‐throughput screening of soluble recombinant proteins , 2002, Protein science : a publication of the Protein Society.

[13]  M Gerstein,et al.  Structural proteomics: prospects for high throughput sample preparation. , 2000, Progress in biophysics and molecular biology.

[14]  John D. Westbrook,et al.  TargetDB: a target registration database for structural genomics projects , 2004, Bioinform..

[15]  Dmitrij Frishman,et al.  Protein solubility: sequence based prediction and experimental verification , 2007, Bioinform..

[16]  R G Harrison,et al.  New fusion protein systems designed to give soluble expression in Escherichia coli. , 1999, Biotechnology and bioengineering.

[17]  David L. Wilkinson,et al.  Predicting the Solubility of Recombinant Proteins in Escherichia coli , 1991, Bio/Technology.

[18]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[19]  B. Leiting,et al.  High-level expression of soluble protein in Escherichia coli using a His6-tag and maltose-binding-protein double-affinity fusion system. , 1997, Protein expression and purification.

[20]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[21]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[22]  H. P. Sørensen,et al.  Advanced genetic strategies for recombinant protein expression in Escherichia coli. , 2005, Journal of biotechnology.

[23]  F. Baneyx Recombinant protein expression in Escherichia coli. , 1999, Current opinion in biotechnology.

[24]  Susan Idicula-Thomas,et al.  Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[25]  Dariusz Plewczynski,et al.  AutoMotif server: prediction of single residue post-translational modifications in proteins , 2005, Bioinform..

[26]  Ming Luo,et al.  Improving solubility of Shewanella oneidensis MR-1 and Clostridium thermocellum JW-20 proteins expressed into Esherichia coli. , 2005, Journal of proteome research.

[27]  Mark Gerstein,et al.  SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics , 2001, Nucleic Acids Res..

[28]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[29]  David E Hill,et al.  High-throughput expression of C. elegans proteins. , 2004, Genome research.

[30]  Mark Gerstein,et al.  Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. , 2004, Journal of molecular biology.

[31]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[32]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[33]  S. Makrides Strategies for achieving high-level expression of genes in Escherichia coli , 1996 .

[34]  R. Lenski,et al.  Genomic divergence of Escherichia coli strains: evidence for horizontal transfer and variation in mutation rates. , 2005, International microbiology : the official journal of the Spanish Society for Microbiology.