Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition

BackgroundExisting methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods.ResultsThis study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble.ConclusionsThe propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role.AvailabilityThe used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/.

[1]  Shinn-Ying Ho,et al.  POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties , 2007, Bioinform..

[2]  A Ikai,et al.  Thermostability and aliphatic index of globular proteins. , 1980, Journal of biochemistry.

[3]  Ankush Meshram,et al.  Virulence prediction model (virprob) using amino acid and dipeptide composition for human pathogens , 2011 .

[4]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[5]  D. Davies,et al.  Catalytic domain of human immunodeficiency virus type 1 integrase: identification of a soluble mutant by systematic replacement of hydrophobic residues. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[6]  S. Larsen,et al.  Kinetics of degradation and oil solubility of ester prodrugs of a model dipeptide (Gly-Phe). , 2004, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[7]  Giovanni Colonna,et al.  Amino acid propensities for secondary structures are influenced by the protein structural class. , 2006, Biochemical and biophysical research communications.

[8]  Shinn-Ying Ho,et al.  Inheritable genetic algorithm for biobjective 0/1 combinatorial optimization problems and its applications , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[9]  C. Pace,et al.  Amino acid contribution to protein solubility: Asp, Glu, and Ser contribute more favorably than the other hydrophilic amino acids in RNase Sa. , 2007, Journal of molecular biology.

[10]  Pierre Baldi,et al.  SOLpro: accurate sequence-based prediction of protein solubility , 2009, Bioinform..

[11]  Emanuele Tomba,et al.  Prediction of protein solubility in Escherichia coli using logistic regression , 2010, Biotechnology and bioengineering.

[12]  Lukasz A. Kurgan,et al.  Prediction of protein structural class using novel evolutionary collocation‐based sequence representation , 2008, J. Comput. Chem..

[13]  G. Dale,et al.  Improving protein solubility through rationally designed amino acid replacements: solubilization of the trimethoprim-resistant type S1 dihydrofolate reductase. , 1994, Protein engineering.

[14]  Shinn-Ying Ho,et al.  Intelligent evolutionary algorithms for large parameter optimization problems , 2004, IEEE Transactions on Evolutionary Computation.

[15]  Scott Dick,et al.  CRYSTALP2: sequence-based protein crystallization propensity prediction , 2009, BMC Structural Biology.

[16]  Bhaskar D. Kulkarni,et al.  A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli , 2006, Bioinform..

[17]  M. Uhlén,et al.  Hydrophobicity engineering to increase solubility and stability of a recombinant protein from respiratory syncytial virus. , 1995, European journal of biochemistry.

[18]  Gajendra P. S. Raghava,et al.  Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein , 2005, BMC Bioinformatics.

[19]  T. Terwilliger,et al.  Engineering soluble proteins for structural genomics , 2002, Nature Biotechnology.

[20]  Gajendra P S Raghava,et al.  Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition* , 2004, Journal of Biological Chemistry.

[21]  Dmitrij Frishman,et al.  PROSO II – a new method for protein solubility prediction , 2012, The FEBS journal.

[22]  Dmitrij Frishman,et al.  Protein solubility: sequence based prediction and experimental verification , 2007, Bioinform..

[23]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[24]  R G Harrison,et al.  New fusion protein systems designed to give soluble expression in Escherichia coli. , 1999, Biotechnology and bioengineering.

[25]  BMC Bioinformatics , 2005 .

[26]  Chun-Nan Hsu,et al.  Learning to predict expression efficacy of vectors in recombinant protein production , 2010, BMC Bioinformatics.

[27]  Susan Idicula-Thomas,et al.  Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[28]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[29]  Hui Ding,et al.  Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. , 2011, Journal of theoretical biology.

[30]  David L. Wilkinson,et al.  Predicting the Solubility of Recombinant Proteins in Escherichia coli , 1991, Bio/Technology.

[31]  Mark Gerstein,et al.  Structural proteomics of an archaeon , 2000, Nature Structural Biology.

[32]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.