A Survey for Predicting Enzyme Family Classes Using Machine Learning Methods.

Enzymes are proteins that act as biological catalysts to speed up cellular biochemical processes. According to their main Enzyme Commission (EC) numbers, enzymes are divided into six categories: EC-1: oxidoreductase; EC-2: transferase; EC-3: hydrolase; EC-4: lyase; EC-5: isomerase and EC-6: synthetase. Different enzymes have different biological functions and acting objects. Therefore, knowing which family an enzyme belongs to can help infer its catalytic mechanism and provide information about the relevant biological function. With the large amount of protein sequences influxing into databanks in the post-genomics age, the annotation of the family for an enzyme is very important. Since the experimental methods are cost ineffective, bioinformatics tool will be a great help for accurately classifying the family of the enzymes. In this review, we summarized the application of machine learning methods in the prediction of enzyme family from different aspects. We hope that this review will provide insights and inspirations for the researches on enzyme family classification.

[1]  Hao Zhang,et al.  FledFold: A Novel Software for RNA Secondary Structure Prediction , 2017, Letters in organic chemistry.

[2]  Yucong Duan,et al.  70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features , 2018, BMC Syst. Biol..

[3]  Zhao Wei,et al.  Using Quadratic Discriminant Analysis to Predict Protein Secondary Structure Based on Chemical Shifts , 2017 .

[4]  Yue Zhao,et al.  MNDR v2.0: an updated resource of ncRNA–disease associations in mammals , 2017, Nucleic Acids Res..

[5]  Kuo-Chen Chou,et al.  Predict and analyze S-nitrosylation modification sites with the mRMR and IFS approaches. , 2012, Journal of proteomics.

[6]  Wei Chen,et al.  Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. , 2015, Molecular bioSystems.

[7]  K. Chou,et al.  iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in Targeting Ion Channels , 2014, BioMed research international.

[8]  Søren Brunak,et al.  Prediction of novel archaeal enzymes from sequence‐derived features , 2002, Protein science : a publication of the Protein Society.

[9]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[10]  Wei Chen,et al.  Recent Advances in Conotoxin Classification by Using Machine Learning Methods , 2017, Molecules.

[11]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[12]  H Herzel,et al.  Correlations in protein sequences and property codes. , 1998, Journal of theoretical biology.

[13]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Wei Chen,et al.  Identifying RNA N6-Methyladenosine Sites in Escherichia coli Genome , 2018, Front. Microbiol..

[15]  Sarah M. Assmann,et al.  Structure-seq2: sensitive and accurate genome-wide profiling of RNA structure in vivo , 2017, Nucleic acids research.

[16]  HaiXia Long,et al.  Deep Convolutional Neural Networks for Predicting Hydroxyproline in Proteins , 2017 .

[17]  Hao Lin The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. , 2008, Journal of theoretical biology.

[18]  Jiangning Song,et al.  Quokka: a comprehensive tool for rapid and accurate prediction of kinase family‐specific phosphorylation sites in the human proteome , 2018, Bioinform..

[19]  K. Chou A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition space , 1995, Proteins.

[20]  Gholamreza Haffari,et al.  PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy , 2018, Bioinform..

[21]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[22]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[23]  Geoffrey I. Webb,et al.  iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites , 2018, Briefings Bioinform..

[24]  Yu-Dong Cai,et al.  Prediction of protein-peptide interaction with nearest neighbor algorithm , 1969 .

[25]  Wei Chen,et al.  iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition , 2015 .

[26]  Yixue Li,et al.  ECS: An automatic enzyme classifier based on functional domain composition , 2007, Comput. Biol. Chem..

[27]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[28]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[29]  Kuo-Chen Chou,et al.  Using GO-PseAA predictor to predict enzyme sub-class. , 2004, Biochemical and biophysical research communications.

[30]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[31]  Pritish Kumar Varadwaj,et al.  DeepInteract: Deep Neural Network Based Protein-Protein Interaction Prediction Tool , 2017 .

[32]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[33]  R. Laxton The measure of diversity. , 1978, Journal of theoretical biology.

[34]  K. Chou,et al.  The biological functions of low‐frequency vibrations (phonons). VI. A possible dynamic mechanism of allosteric transition in antibody molecules , 1987, Biopolymers.

[35]  Jianding Qiu,et al.  Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. , 2010, Protein and peptide letters.

[36]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[37]  Qianzhong Li,et al.  Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components , 2007, J. Comput. Chem..

[38]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[39]  Hao Lin,et al.  Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant. , 2007, Biochemical and biophysical research communications.

[40]  Xiao-Qing Yu,et al.  Predicting protein structural class by incorporating patterns of over-represented k-mers into the general form of Chou's PseAAC. , 2012, Protein and peptide letters.

[41]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[42]  Wei Chen,et al.  iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties , 2012, PloS one.

[43]  Mandana Behbahani,et al.  Predicting antibacterial peptides by the concept of Chou's pseudo-amino acid composition and machine learning methods. , 2012, Protein and peptide letters.

[44]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[45]  P. Dobson,et al.  Predicting enzyme class from protein structure without alignments. , 2005, Journal of molecular biology.

[46]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[47]  Wei Chen,et al.  Naïve Bayes Classifier with Feature Selection to Identify Phage Virion Proteins , 2013, Comput. Math. Methods Medicine.

[48]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[49]  Bilwaj Gaonkar,et al.  Analytic estimation of statistical significance maps for support vector machine based multi-variate image analysis and classification , 2013, NeuroImage.

[50]  Efendi N. Nasibov,et al.  Efficiency analysis of KNN and minimum distance-based classifiers in enzyme family prediction , 2009, Comput. Biol. Chem..

[51]  Hui Ding,et al.  BDB: biopanning data bank , 2015, Nucleic Acids Res..

[52]  Hua Tang,et al.  Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition , 2016, BioMed research international.

[53]  Hao Lin,et al.  Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[54]  Po Huang,et al.  Identification of Secretory Proteins of Malaria Parasite by Feature Selection Technique , 2017 .

[55]  Balachandran Manavalan,et al.  MLACP: machine-learning-based prediction of anticancer peptides , 2017, Oncotarget.

[56]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[57]  Jooyoung Lee,et al.  SVMQA: support‐vector‐machine‐based protein single‐model quality assessment , 2017, Bioinform..

[58]  Guohua Huang,et al.  The Advances and Challenges of Deep Learning Application in Biological Big Data Processing , 2017, Current Bioinformatics.

[59]  Zhanchao Li,et al.  Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. , 2007, Journal of theoretical biology.

[60]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[61]  Hua Tang,et al.  Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. , 2016, Molecular bioSystems.

[62]  Mohammed Bennamoun,et al.  ECMSRC: A Sparse Learning Approach for the Prediction of Extracellular Matrix Proteins , 2017 .

[63]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[64]  Q. Z. Li,et al.  The prediction of the structural class of protein: application of the measure of diversity. , 2001, Journal of theoretical biology.

[65]  Yan Huang,et al.  RNALocate: a resource for RNA subcellular localizations , 2016, Nucleic Acids Res..

[66]  Wei Chen,et al.  Prediction of thermophilic proteins using feature selection technique. , 2011, Journal of microbiological methods.

[67]  Wei Chen,et al.  Sequence-based predictive modeling to identify cancerlectins , 2017, Oncotarget.

[68]  Q. Zou,et al.  Protein Folds Prediction with Hierarchical Structured SVM , 2016 .

[69]  Kuo-Chen Chou,et al.  A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology. , 2003, Biochemical and biophysical research communications.

[70]  Wei Chen,et al.  Identification of Antioxidants from Sequence Information Using Naïve Bayes , 2013, Comput. Math. Methods Medicine.

[71]  A. Esmaeili,et al.  Prediction of GABAA receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine. , 2011, Journal of theoretical biology.

[72]  K. Chou,et al.  Low-frequency motions in protein molecules. Beta-sheet and beta-barrel. , 1985, Biophysical journal.

[73]  Didier Dormont,et al.  Spatial Regularization of Svm for the Detection of Diffusion Alterations Associated with Stroke Outcome , 2022 .

[74]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[75]  Zhao Wei,et al.  Identify Protein 8-Class Secondary Structure with Quadratic Discriminant Algorithm based on the Feature Combination , 2017 .

[76]  Xiuzhen Hu,et al.  Predicting enzyme subclasses by using support vector machine with composite vectors. , 2010, Protein and peptide letters.

[77]  Dinesh Gupta,et al.  Identifying Bacterial Virulent Proteins by Fusing a Set of Classifiers Based on Variants of Chou's Pseudo Amino Acid Composition and on Evolutionary Information , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[78]  Alex Bateman,et al.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites , 2001, Nucleic Acids Res..

[79]  Lourdes Santana,et al.  Proteomics, networks and connectivity indices , 2008, Proteomics.

[80]  Yue Zhao,et al.  RAID v2.0: an updated resource of RNA-associated interactions across organisms , 2016, Nucleic Acids Res..

[81]  Jiangning Song,et al.  Toward more accurate prediction of caspase cleavage sites: a comprehensive review of current methods, tools and features , 2018, Briefings Bioinform..

[82]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[83]  Hua Tang,et al.  A two-step discriminated method to identify thermophilic proteins , 2017 .

[84]  Hua Tang,et al.  Identify and analysis crotonylation sites in histone by using support vector machines , 2017, Artif. Intell. Medicine.

[85]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[86]  Jiangning Song,et al.  ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides , 2018, Bioinform..

[87]  Parviz Abdolmaleki,et al.  Prediction of membrane protein types by means of wavelet analysis and cascaded neural networks. , 2008, Journal of theoretical biology.

[88]  Gianni Podda,et al.  Prediction of enzyme classes from 3D structure: a general model and examples of experimental-theoretic scoring of peptide mass fingerprints of Leishmania proteins. , 2009, Journal of proteome research.

[89]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[90]  Balachandran Manavalan,et al.  DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest , 2017, bioRxiv.

[91]  Liaofu Luo,et al.  Splice site prediction with quadratic discriminant analysis using diversity measure. , 2003, Nucleic acids research.

[92]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[93]  Ken A. Dill,et al.  Predicting Peptide Structures in Native Proteins from Physical Simulations of Fragments , 2009, PLoS Comput. Biol..

[94]  Z. Liao,et al.  Improved Identification of Cytokines Using Feature Selection Techniques , 2017 .

[95]  Jianding Qiu,et al.  Prediction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. , 2009, Analytical biochemistry.

[96]  Wei Chen,et al.  iRNA-2OM: A Sequence-Based Predictor for Identifying 2′-O-Methylation Sites in Homo sapiens , 2018, J. Comput. Biol..

[97]  Miao Sun,et al.  QAcon: single model quality assessment using protein structural and contact information with machine learning techniques , 2016, Bioinform..

[98]  N. Xia,et al.  Using a Machine-Learning Approach to Predict Discontinuous Antibody-Specific B-Cell Epitopes , 2017 .

[99]  Xiaowei Zhao,et al.  Predicting protein-protein interactions by combing various sequence- derived features into the general form of Chou's Pseudo amino acid composition. , 2012, Protein and peptide letters.

[100]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[101]  Jijun Tang,et al.  Predicting S-sulfenylation Sites Using Physicochemical Properties Differences , 2017 .

[102]  Bing Niu,et al.  Prediction of Enzyme’s Family Based on Protein-Protein Interaction Network , 2015 .

[103]  K. Chou,et al.  Low-frequency resonance and cooperativity of hemoglobin. , 1989, Trends in biochemical sciences.

[104]  A. Dillmann Enzyme Nomenclature , 1965, Nature.

[105]  M. Esmaeili,et al.  Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. , 2010, Journal of theoretical biology.

[106]  Wei Chen,et al.  AOD: the antioxidant protein database , 2017, Scientific Reports.

[107]  K. Chou,et al.  Biological functions of low-frequency vibrations (phonons). III. Helical structures and microenvironment. , 1984, Biophysical journal.

[108]  Humberto González Díaz,et al.  Computational chemistry study of 3D‐structure‐function relationships for enzymes based on Markov models for protein electrostatic, HINT, and van der Waals potentials , 2009, J. Comput. Chem..

[109]  Ying Liang,et al.  Seeksv: an accurate tool for somatic structural variation and virus integration detection , 2017, Bioinform..

[110]  Federico E. Turkheimer,et al.  Chromosomal patterns of gene expression from microarray data: methodology, validation and clinical relevance in gliomas , 2006, BMC Bioinformatics.

[111]  K. Chou,et al.  EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. , 2007, Biochemical and biophysical research communications.

[112]  Kuo-Chen Chou,et al.  Predicting enzyme subclass by functional domain composition and pseudo amino acid composition. , 2005, Journal of proteome research.

[113]  Arun Krishnan,et al.  Predicting allergenic proteins using wavelet transform , 2004, Bioinform..

[114]  Peer Bork,et al.  SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[115]  Hassan Mohabatkar,et al.  Prediction of allergenic proteins by means of the concept of Chou's pseudo amino acid composition and a machine learning approach. , 2012, Medicinal chemistry (Shariqah (United Arab Emirates)).

[116]  K. R. Woods,et al.  Prediction of protein antigenic determinants from amino acid sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[117]  F. Prado-Prado,et al.  Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. , 2008, Current topics in medicinal chemistry.

[118]  Narmada Thanki,et al.  CDD: a conserved domain database for interactive domain family analysis , 2006, Nucleic Acids Res..

[119]  Michael F. Shlesinger,et al.  WAVELET TRANSFORMATION OF PROTEIN HYDROPHOBICITY SEQUENCES SUGGESTS THEIR MEMBERSHIPS IN STRUCTURAL FAMILIES , 1997 .

[120]  M. Wang,et al.  Low-frequency Fourier spectrum for predicting membrane protein types. , 2005, Biochemical and biophysical research communications.

[121]  Loris Nanni,et al.  Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization , 2008, Amino Acids.

[122]  H. Mohabatkar,et al.  Predicting anticancer peptides with Chou's pseudo amino acid composition and investigating their mutagenicity via Ames test. , 2014, Journal of theoretical biology.

[123]  Kuo-Chen Chou,et al.  Using functional domain composition to predict enzyme family classes. , 2005, Journal of proteome research.

[124]  Ramakrishna Ramaswamy,et al.  Wavelet Analysis of DNA Walks , 2006, J. Comput. Biol..

[125]  Hassan Mohabatkar,et al.  Prediction of cyclin proteins using Chou's pseudo amino acid composition. , 2010, Protein and peptide letters.

[126]  K. Chou,et al.  Low-frequency collective motion in biomacromolecules and its biological functions. , 1988, Biophysical chemistry.

[127]  C. Zhang,et al.  Predicting protein folding types by distance functions that make allowances for amino acid interactions. , 1994, The Journal of biological chemistry.

[128]  Kuo-Chen Chou,et al.  Predicting enzyme family classes by hybridizing gene product composition and pseudo-amino acid composition. , 2005, Journal of theoretical biology.

[129]  Ganapati Panda,et al.  A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction , 2010, Comput. Biol. Chem..

[130]  Hua Tang,et al.  IonchanPred 2.0: A Tool to Predict Ion Channels and Their Types , 2017, International journal of molecular sciences.

[131]  Shao-Ping Shi,et al.  Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. , 2010, Protein and peptide letters.

[132]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[133]  K. Chou,et al.  iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. , 2013, Analytical biochemistry.

[134]  C. Tanford Contribution of Hydrophobic Interactions to the Stability of the Globular Conformation of Proteins , 1962 .

[135]  Gianluca Pollastri,et al.  Accurate prediction of protein enzymatic class by N-to-1 Neural Networks , 2013, BMC Bioinformatics.

[136]  Mohd Saberi Mohamad,et al.  A Review of Computational Approaches to Predict Gene Functions , 2017 .

[137]  K. Chou,et al.  Prediction of protein secondary structure content. , 1999, Protein engineering.

[138]  L. G. Pérez-Montoto,et al.  3D entropy and moments prediction of enzyme classes and experimental-theoretic study of peptide fingerprints in Leishmania parasites. , 2009, Biochimica et biophysica acta.

[139]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[140]  Kuo-Chen Chou,et al.  Prediction of enzyme family classes. , 2003, Journal of proteome research.

[141]  Jing Ye,et al.  Predicting the Types of Plant Heat Shock Proteins , 2017 .

[142]  Wei Chen,et al.  Pro54DB: a database for experimentally verified sigma-54 promoters. , 2016, Bioinformatics.

[143]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[144]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[145]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[146]  Wei Chen,et al.  Predicting Human Enzyme Family Classes by Using Pseudo Amino Acid Composition , 2016 .