EPSOL: sequence-based protein solubility prediction using multidimensional embedding

MOTIVATION The heterologous expression of recombinant protein requires host cells, such as Escherichia coli, and the solubility of protein greatly affects the protein yield. A novel and highly accurate solubility predictor that concurrently improves the production yield and minimizes production cost, and that forecasts protein solubility in an E. coli expression system before the actual experimental work is highly sought. RESULTS In this paper, EPSOL, a novel deep learning architecture for the prediction of protein solubility in an E. coli expression system, which automatically obtains comprehensive protein feature representations using multidimensional embedding, is presented. EPSOL outperformed all existing sequence-based solubility predictors and achieved 0.79 in accuracy and 0.58 in Matthew's correlation coefficient. The higher performance of EPSOL permits large-scale screening for sequence variants with enhanced manufacturability and predicts the solubility of new recombinant proteins in an E. coli expression system with greater reliability. AVAILABILITY AND IMPLEMENTATION EPSOL's best model and results can be downloaded from GitHub (https://github.com/LiangYu-Xidian/EPSOL). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Susan Idicula-Thomas,et al.  Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[2]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[3]  Dmitrij Frishman,et al.  Protein solubility: sequence based prediction and experimental verification , 2007, Bioinform..

[4]  Hao Lv,et al.  Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method , 2020, Briefings Bioinform..

[5]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[6]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[7]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[8]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[9]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[10]  Peng Gao,et al.  Predicting Thermophilic Proteins by Machine Learning , 2020, Current Bioinformatics.

[11]  Mark Gerstein,et al.  SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics , 2001, Nucleic Acids Res..

[12]  Hao Lv,et al.  DeepYY1: a deep learning approach to identify YY1-mediated chromatin loops , 2020, Briefings Bioinform..

[13]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[14]  Michele Vendruscolo,et al.  Sequence-based prediction of protein solubility. , 2012, Journal of molecular biology.

[15]  Shuigeng Zhou,et al.  Predicting Enhancers from Multiple Cell Lines and Tissues across Different Developmental Stages Based On SVM Method , 2018, Current Bioinformatics.

[16]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[17]  Pierre Baldi,et al.  SOLpro: accurate sequence-based prediction of protein solubility , 2009, Bioinform..

[18]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[19]  Hao Jiang,et al.  Densely Dilated Spatial Pooling Convolutional Network using benign loss functions for imbalanced volumetric prostate segmentation , 2018 .

[20]  Jiangning Song,et al.  Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction , 2014, Briefings Bioinform..

[21]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[22]  Quan Zou,et al.  SecProMTB: Support Vector Machine‐Based Classifier for Secretory Proteins Using Imbalanced Data Sets Applied to Mycobacterium tuberculosis , 2019, Proteomics.

[23]  R G Harrison,et al.  New fusion protein systems designed to give soluble expression in Escherichia coli. , 1999, Biotechnology and bioengineering.

[24]  Guisheng Yin,et al.  Stability and Hopf Bifurcation Analysis of an Epidemic Model with Time Delay , 2021, Comput. Math. Methods Medicine.

[25]  Raghvendra Mall,et al.  DeepSol: a deep learning framework for sequence‐based protein solubility prediction , 2018, Bioinform..

[26]  Hui Yang,et al.  iCarPS: a computational tool for identifying protein carbonylation sites by novel encoded features , 2020, Bioinform..

[27]  Dmitrij Frishman,et al.  PROSO II – a new method for protein solubility prediction , 2012, The FEBS journal.

[28]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Hao Lin,et al.  Predicting Preference of Transcription Factors for Methylated DNA Using Sequence Information , 2020, Molecular therapy. Nucleic acids.

[30]  Hui Ding,et al.  Escherichia Coli DNA N-4-Methycytosine Site Prediction Accuracy Improved by Light Gradient Boosting Machine Feature Selection Technology , 2020, IEEE Access.

[31]  Wei Chen,et al.  Design powerful predictor for mRNA subcellular location prediction in Homo sapiens , 2020, Briefings Bioinform..

[32]  Pierre Baldi,et al.  SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity , 2014, Bioinform..

[33]  Chenhui Yang,et al.  Exploiting Discriminative Regions of Brain Slices Based on 2D CNNs for Alzheimer’s Disease Classification , 2019, IEEE Access.

[34]  David L. Wilkinson,et al.  Predicting the Solubility of Recombinant Proteins in Escherichia coli , 1991, Bio/Technology.

[35]  Yu Yao,et al.  ConvsPPIS: Identifying Protein-protein Interaction Sites by an Ensemble Convolutional Neural Network with Feature Graph , 2020, Current Bioinformatics.

[36]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[37]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[38]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machines , 2002 .

[39]  Raghvendra Mall,et al.  PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine , 2018, Bioinform..

[40]  Mark Gerstein,et al.  Structural proteomics of an archaeon , 2000, Nature Structural Biology.