Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning

Accurate prediction of the optimal catalytic temperature (Topt) of enzymes is vital in biotechnology, as enzymes with high Topt values are desired for enhanced reaction rates. Recently, a machine-learning method (TOME) for predicting Topt was developed. TOME was trained on a normally-distributed dataset with a median Topt of 37°C and less than five percent of Topt values above 85°C, limiting the method’s predictive capabilities for thermostable enzymes. Due to the distribution of the training data, the mean squared error on Topt values greater than 85°C is nearly an order of magnitude higher than the error on values between 30 and 50°C. In this study, we apply ensemble learning and resampling strategies that tackle the data imbalance to significantly decrease the error on high Topt values (>85°C) by 60% and increase the overall R2 value from 0.527 to 0.632. The revised method, TOMER, and the resampling strategies applied in this work are freely available to other researchers as a Python package on GitHub.

[1]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[2]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[3]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[4]  Satoshi Fukuchi,et al.  Compositional changes in RNA, DNA and proteins for bacterial adaptation to higher and lower temperatures. , 2003, Journal of biochemistry.

[5]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[6]  Piero Fariselli,et al.  A neural-network-based method for predicting protein stability changes upon single point mutations , 2004, ISMB/ECCB.

[7]  Jan Gorodkin,et al.  Comparing two K-category assignments by a K-category correlation coefficient , 2004, Comput. Biol. Chem..

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[10]  Baishan Fang,et al.  Support vector machine for discrimination of thermophilic and mesophilic proteins based on amino acid composition. , 2006, Protein and peptide letters.

[11]  Arlo Z. Randall,et al.  Prediction of protein stability changes for single‐site mutations using support vector machines , 2005, Proteins.

[12]  Luís Torgo,et al.  Utility-Based Regression , 2007, PKDD.

[13]  Igor N. Berezovsky,et al.  Protein and DNA Sequence Determinants of Thermophilic Adaptation , 2006, PLoS Comput. Biol..

[14]  M Michael Gromiha,et al.  Discrimination of mesophilic and thermophilic proteins using machine learning algorithms , 2007, Proteins.

[15]  Szymon Wilk,et al.  Selective Pre-processing of Imbalanced Data for Improving Classification Performance , 2008, DaWaK.

[16]  Piero Fariselli,et al.  Predicting protein thermostability changes from sequence upon multiple mutations , 2008, ISMB.

[17]  M. Rooman,et al.  Revisiting the correlation between proteins' thermoresistance and organisms' thermophilicity. , 2008, Protein engineering, design & selection : PEDS.

[18]  Jorng-Tzong Horng,et al.  An expert system to predict protein thermostability using decision tree , 2009, Expert Syst. Appl..

[19]  Luís Torgo,et al.  Precision and Recall for Regression , 2009, Discovery Science.

[20]  Philippe Bogaerts,et al.  Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0 , 2009, Bioinform..

[21]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[22]  Naiwan Hsiao,et al.  Predicting melting temperature directly from protein sequences , 2009, Comput. Biol. Chem..

[23]  Jianwen Fang,et al.  A novel scoring function for discriminating hyperthermophilic and mesophilic proteins with application to predicting relative thermostability of protein mutants , 2010, BMC Bioinformatics.

[24]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[25]  Szymon Wilk,et al.  Integrating Selective Pre-processing of Imbalanced Data with Ivotes Ensemble , 2010, RSCTC.

[26]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[27]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[28]  P. Haris,et al.  Predicting a protein's melting temperature from its amino acid sequence , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  Chumphol Bunkhumpornpat,et al.  MUTE: Majority under-sampling technique , 2011, 2011 8th International Conference on Information, Communications & Signal Processing.

[31]  Wei Chen,et al.  Prediction of thermophilic proteins using feature selection technique. , 2011, Journal of microbiological methods.

[32]  Mansour Ebrahimi,et al.  Prediction of Thermostability from Amino Acid Attributes by Combination of Clustering with Attribute Weighting: A New Vista in Engineering Enzymes , 2011, PloS one.

[33]  Nicola Torelli,et al.  Training and assessing classification rules with imbalanced data , 2012, Data Mining and Knowledge Discovery.

[34]  Min Zhu,et al.  Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions , 2012, Comput. Biol. Chem..

[35]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[36]  Jianwen Fang,et al.  PROTS-RF: A Robust Model for Predicting Mutation-Induced Protein Stability Changes , 2012, PloS one.

[37]  Dan B Jensen,et al.  Bayesian prediction of bacterial growth temperature range based on genome sequences , 2012, BMC Genomics.

[38]  Wei Chen,et al.  A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins , 2013, Amino Acids.

[39]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[40]  Marianne Rooman,et al.  Protein Thermostability Prediction within Homologous Families Using Temperature-Dependent Statistical Potentials , 2014, PloS one.

[41]  Ajith Abraham,et al.  A Review of Class Imbalance Problem , 2014 .

[42]  Philip M. Kim,et al.  Combining Structural Modeling with Ensemble Machine Learning to Accurately Predict Protein Fold Stability and Binding Affinity Effects upon Mutation , 2014, PloS one.

[43]  Luís Torgo,et al.  Resampling strategies for regression , 2015, Expert Syst. J. Knowl. Eng..

[44]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[45]  Luís Torgo,et al.  UBL: an R package for Utility-based Learning , 2016, ArXiv.

[46]  Marianne Rooman,et al.  Predicting protein thermal stability changes upon point mutations using statistical potentials: Introducing HoTMuSiC , 2016, Scientific Reports.

[47]  Guo-Liang Fan,et al.  Identification of thermophilic proteins by incorporating evolutionary and acid dissociation information into Chou's general pseudo amino acid composition. , 2016, Journal of theoretical biology.

[48]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[49]  Ashok K. Sharma,et al.  ToxiM: A Toxicity Prediction Tool for Small Molecules Developed Using Machine Learning and Chemoinformatics Approaches , 2017, Front. Pharmacol..

[50]  Luís Torgo,et al.  SMOGN: a Pre-processing Approach for Imbalanced Regression , 2017, LIDTA@PKDD/ECML.

[51]  Luís Torgo,et al.  REBAGG: REsampled BAGGing for Imbalanced Regression , 2018, LIDTA@ECML/PKDD.

[52]  Marta M. Stepniewska-Dziubinska,et al.  Development and evaluation of a deep learning model for protein–ligand binding affinity prediction , 2017, Bioinform..

[53]  Luís Torgo,et al.  MetaUtil: Meta Learning for Utility Maximization in Regression , 2018, DS.

[54]  Geoffrey I. Webb,et al.  iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences , 2018, Bioinform..

[55]  Liping He,et al.  DeepDDG: Predicting the Stability Change of Protein Point Mutations Using Neural Networks , 2019, J. Chem. Inf. Model..

[56]  Andrzej Kloczkowski,et al.  Robust Prediction of Single and Multiple Point Protein Mutations Stability Changes , 2019, Biomolecules.

[57]  Gang Li,et al.  Machine Learning Applied to Predicting Microorganism Growth Temperatures and Enzyme Catalytic Optima. , 2019, ACS synthetic biology.

[58]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[59]  Luís Torgo,et al.  Pre-processing approaches for imbalanced distributions in regression , 2019, Neurocomputing.

[60]  Antje Chang,et al.  BRENDA in 2019: a European ELIXIR core data resource , 2018, Nucleic Acids Res..

[61]  Gang Li,et al.  Performance of Regression Models as a Function of Experiment Noise , 2019, Bioinformatics and biology insights.