ProGAN: Protein solubility generative adversarial nets for data augmentation in DNN framework

Abstract Protein solubility plays a critical role in improving production yield of recombinant proteins in biocatalysis applications. To some extent, protein solubility can represent the function and activity of biocatalysts which are mainly composed of recombinant proteins. In literature, many machine learning models have been investigated to predict protein solubility from protein sequence, whereas parameters of those models were underdetermined with insufficient data of protein solubility. Here we propose a deep neural network (DNN) as a more accurate regression predictive model. Moreover, to tackle the insufficient data problem, a novel data augmentation algorithm, Protein Solubility Generative Adversarial Nets (ProGAN), was proposed for improving the prediction of protein solubility. After adding mimic data produced from ProGAN, the prediction performance measured by R2 was improved compared with that without data augmentation. A R2 value of 0.4504 was achieved, which was enhanced about 10% compared with the previous study using the same dataset.

[1]  Lu Zhang,et al.  From machine learning to deep learning: progress in machine intelligence for rational drug discovery. , 2017, Drug discovery today.

[2]  Raghvendra Mall,et al.  DeepSol: a deep learning framework for sequence‐based protein solubility prediction , 2018, Bioinform..

[3]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[4]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[5]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[6]  Pierre Baldi,et al.  SOLpro: accurate sequence-based prediction of protein solubility , 2009, Bioinform..

[7]  Emanuele Tomba,et al.  Prediction of protein solubility in Escherichia coli using logistic regression , 2010, Biotechnology and bioengineering.

[8]  H. Mori,et al.  Complete set of ORF clones of Escherichia coli ASKA library (a complete set of E. coli K-12 ORF archive): unique resources for biological research. , 2006, DNA research : an international journal for rapid publication of reports on genes and genomes.

[9]  Shuichi Hirose,et al.  Statistical analysis of features associated with protein expression/solubility in an in vivo Escherichia coli expression system and a wheat germ cell-free expression system. , 2011, Journal of biochemistry.

[10]  Parameswaran Binod,et al.  Strategies for design of improved biocatalysts for industrial applications. , 2017, Bioresource technology.

[11]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[12]  Dmitrij Frishman,et al.  Protein solubility: sequence based prediction and experimental verification , 2007, Bioinform..

[13]  Shuichi Hirose,et al.  ESPRESSO: A system for estimating protein expression and solubility in protein expression systems , 2013, Proteomics.

[14]  Hao Hu,et al.  Global Versus Localized Generative Adversarial Nets , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Susan Idicula-Thomas,et al.  Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[16]  Mark Gerstein,et al.  Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. , 2004, Journal of molecular biology.

[17]  Kapil G. Gadkar,et al.  On-line adaptation of neural networks for bioprocess control , 2005, Comput. Chem. Eng..

[18]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[19]  Niu Xiaohui,et al.  Predicting the protein solubility by integrating chaos games representation and entropy in information theory , 2014, Expert Systems with Applications.

[20]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[21]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[22]  Jiangning Song,et al.  Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction , 2014, Briefings Bioinform..

[23]  Siti Zaiton Mohd Hashim,et al.  A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli , 2014, BMC Bioinformatics.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[26]  Daniel C. Zielinski,et al.  Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models , 2018, Nature Communications.

[27]  Jianwen Fang,et al.  Discrimination of soluble and aggregation-prone proteins based on sequence information. , 2013, Molecular bioSystems.

[28]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[29]  David L. Wilkinson,et al.  Predicting the Solubility of Recombinant Proteins in Escherichia coli , 1991, Bio/Technology.

[30]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[31]  Chun-Nan Hsu,et al.  Learning to predict expression efficacy of vectors in recombinant protein production , 2010, BMC Bioinformatics.

[32]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[33]  Mark Gerstein,et al.  Structural proteomics of an archaeon , 2000, Nature Structural Biology.

[34]  Michele Vendruscolo,et al.  Sequence-based prediction of protein solubility. , 2012, Journal of molecular biology.

[35]  Xiaonan Wang,et al.  Develop machine learning-based regression predictive models for engineering protein solubility , 2019, Bioinform..

[36]  Luhua Lai,et al.  Deep Learning for Drug-Induced Liver Injury , 2015, J. Chem. Inf. Model..

[37]  Raghvendra Mall,et al.  PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine , 2018, Bioinform..

[38]  David T. Westwick,et al.  Application of neural networks for optimal-setpoint design and MPC control in biological wastewater treatment , 2018, Comput. Chem. Eng..

[39]  Pineda,et al.  Generalization of back-propagation to recurrent neural networks. , 1987, Physical review letters.

[40]  Qing Zeng-Treitler,et al.  Predicting sample size required for classification performance , 2012, BMC Medical Informatics and Decision Making.

[41]  Shoji Takada,et al.  Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins , 2009, Proceedings of the National Academy of Sciences.

[42]  Dmitrij Frishman,et al.  PROSO II – a new method for protein solubility prediction , 2012, The FEBS journal.

[43]  Meng Wang,et al.  Biocatalyst development by directed evolution. , 2012, Bioresource technology.

[44]  Selen Cremaschi,et al.  Process synthesis of biodiesel production plant using artificial neural networks as the surrogate models , 2012, Comput. Chem. Eng..

[45]  Sergey Plis,et al.  Deep Learning Applications for Predicting Pharmacological Properties of Drugs and Drug Repurposing Using Transcriptomic Data. , 2016, Molecular pharmaceutics.