Machine learning techniques for protein function prediction

Proteins play important roles in living organisms, and their function is directly linked with their structure. Due to the growing gap between the number of proteins being discovered and their functional characterization (in particular as a result of experimental limitations), reliable prediction of protein function through computational means has become crucial. This paper reviews the machine learning techniques used in the literature, following their evolution from simple algorithms such as logistic regression to more advanced methods like support vector machines and modern deep neural networks. Hyperparameter optimization methods adopted to boost prediction performance are presented. In parallel, the metamorphosis in the features used by these algorithms from classical physicochemical properties and amino acid composition, up to text‐derived features from biomedical literature and learned feature representations using autoencoders, together with feature selection and dimensionality reduction techniques, are also reviewed. The success stories in the application of these techniques to both general and specific protein function prediction are discussed.

[1]  Karin M. Verspoor,et al.  Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct , 2015, J. Biomed. Semant..

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[4]  Alex Alves Freitas,et al.  A hierarchical neural network for predicting protein functions , 2015, 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE).

[5]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[6]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7]  Stavros Makrodimitris,et al.  Improving protein function prediction using protein sequence and GO-term similarities , 2018, Bioinform..

[8]  Richard Bonneau,et al.  deepNF: deep network fusion for protein function prediction , 2017, bioRxiv.

[9]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[10]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  Cristiane Neri Nobre,et al.  The use of support vector machine and genetic algorithms to predict protein function , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[13]  Azuraliza Abu Bakar,et al.  Data mining framework for protein function prediction , 2008, 2008 International Symposium on Information Technology.

[14]  Hagit Shatkay,et al.  Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge , 2013, BMC Bioinformatics.

[15]  J. Gardy,et al.  Methods for predicting bacterial protein subcellular localization , 2006, Nature Reviews Microbiology.

[16]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[17]  Luis Roberto Mercado-Diaz,et al.  A comparison of class-balance strategies for SVM in the problem of protein function prediction , 2015, 2015 20th Symposium on Signal Processing, Images and Computer Vision (STSIVA).

[18]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[19]  Hannah Currant,et al.  FFPred 3: feature-based function prediction for all Gene Ontology domains , 2016, Scientific Reports.

[20]  Xing Chen,et al.  Predicting protein-protein interactions from protein sequences by a stacked sparse autoencoder deep neural network. , 2017, Molecular bioSystems.

[21]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[22]  Zhiwen Yu,et al.  Protein Function Prediction Using Multilabel Ensemble Classification , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Zhi-Hua Zhou,et al.  Genome-Wide Protein Function Prediction through Multi-Instance Multi-Label Learning , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[25]  Ankita Srivastava,et al.  A Comparative Analysis of SVM Random Forest Methods for Protein Function Prediction , 2017, 2017 International Conference on Current Trends in Computer, Electrical, Electronics and Communication (CTCEEC).

[26]  Zhi-Hua Zhou,et al.  Multi-instance multi-label learning , 2008, Artif. Intell..

[27]  Michal Brylinski,et al.  Prediction of protein–protein interaction sites from weakly homologous template structures using meta‐threading and machine learning , 2015, Journal of molecular recognition : JMR.

[28]  Yanay Ofran,et al.  De-novo protein function prediction using DNA binding and RNA binding proteins as a test case , 2016, Nature Communications.

[29]  Hardeep Singh,et al.  Decision Tree Classifier for Human Protein Function Prediction , 2006, 2006 International Conference on Advanced Computing and Communications.

[30]  Ali Al-Shahib,et al.  Franksum: new feature selection method for protein function prediction , 2005, Int. J. Neural Syst..

[31]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[32]  Zhanchao Li,et al.  Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. , 2007, Journal of theoretical biology.

[33]  Yu-Dong Cai,et al.  Support Vector Machines for predicting protein structural class , 2001, BMC Bioinformatics.

[34]  Wenyong Zhu,et al.  Predicting Protein Functions of Bacteria Genomes via Multi-instance Multi-Label Active Learning , 2018, 2018 IEEE 3rd International Conference on Integrated Circuits and Microsystems (ICICM).

[35]  Yu Li,et al.  mlDEEPre: Multi-Functional Enzyme Function Prediction With Hierarchical Multi-Label Deep Learning , 2019, Front. Genet..

[36]  Douglas L. Brutlag,et al.  Sequence Motifs: Highly Predictive Features of Protein Function , 2006, Feature Extraction.

[37]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[38]  Lorenzo Rosasco,et al.  Publisher Accessed Terms of Use Detailed Terms Mathematics of the Neural Response , 2022 .

[39]  Shanfeng Zhu,et al.  DeepText2Go: Improving large-scale protein function prediction with deep semantic text representation , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[40]  Jinglu Hu,et al.  A Deep Learning Approach Based on Stacked Denoising Autoencoders for Protein Function Prediction , 2018, 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC).

[41]  Heng Huang,et al.  From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  Oliver Kohlbacher,et al.  MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition , 2006, Bioinform..

[43]  Kuo-Chen Chou,et al.  pLoc_bal‐mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC , 2018, Bioinform..

[44]  Yen-Wei Chu,et al.  Predicting human protein subcellular localization by heterogeneous and comprehensive approaches , 2017, PloS one.

[45]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.

[46]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[47]  Wei Chen,et al.  Identification of apolipoprotein using feature selection technique , 2016, Scientific Reports.

[48]  Keun Ho Ryu,et al.  Feature Extraction in Spatially-Conserved Regions and Protein Functional Classification , 2007, 2007 Frontiers in the Convergence of Bioscience and Information Technologies.

[49]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[50]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[51]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[52]  Azuraliza Abu Bakar,et al.  Experimental Study of Different FSAs in Classifying Protein Function , 2009, 2009 International Conference of Soft Computing and Pattern Recognition.

[53]  Achuthsankar S. Nair,et al.  Composition, Transition and Distribution (CTD) — A dynamic feature for predictions based on hierarchical structure of cellular sorting , 2011, 2011 Annual IEEE India Conference.

[54]  Tapio Salakoski,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[55]  Petras J. Kundrotas,et al.  Natural language processing in text mining for structural modeling of protein complexes , 2018, BMC Bioinformatics.

[56]  Alfredo Benso,et al.  Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins , 2017, bioRxiv.

[57]  Rainer Breitling,et al.  Predicting protein function by machine learning on amino acid sequences – a critical evaluation , 2007, BMC Genomics.

[58]  Shu-Bo Zhang,et al.  Predicting protein subcellular localization based on information content of gene ontology terms , 2016, Comput. Biol. Chem..

[59]  Ting Chen,et al.  Diffusion kernel-based logistic regression models for protein function prediction. , 2006, Omics : a journal of integrative biology.

[60]  Yun Wu,et al.  Survey of Natural Language Processing Techniques in Bioinformatics , 2015, Comput. Math. Methods Medicine.

[61]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[62]  Alex Alves Freitas,et al.  An Extended Local Hierarchical Classifier for Prediction of Protein and Gene Functions , 2013, DaWaK.

[63]  D. Eisenberg,et al.  Protein function in the post-genomic era , 2000, Nature.

[64]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[65]  Shuang Li,et al.  SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity , 2016, PloS one.

[66]  Xin Zhang,et al.  Protein function prediction based on physiochemical properties and protein granularity , 2013, 2013 IEEE International Conference on Granular Computing (GrC).

[67]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[68]  Luis E. Zárate,et al.  Multi-Objective Genetic Algorithm for Feature Selection in a Protein Function Prediction Context , 2018, 2018 IEEE Congress on Evolutionary Computation (CEC).

[69]  Mahesh Sharma and Prabha Garg Computational Approaches for Enzyme Functional Class Prediction: A Review , 2014 .

[70]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[71]  P. Radivojac,et al.  Analysis of protein function and its prediction from amino acid sequence , 2011, Proteins.

[72]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[73]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[74]  Karin M. Verspoor,et al.  Roles for text mining in protein function prediction. , 2014, Methods in molecular biology.

[75]  Alioune Ngom,et al.  Prediction of Calmodulin-Binding Proteins Using Short-Linear Motifs , 2017, IWBBIO.

[76]  Jonathan Qiang Jiang,et al.  Predicting Protein Function by Multi-Label Correlated Semi-Supervised Learning , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[77]  Shuzlina Abdul Rahman,et al.  Feature selection and classification of protein subfamilies using Rough Sets , 2009, 2009 International Conference on Electrical Engineering and Informatics.

[78]  Wei Zhou,et al.  Predicting protein function via multi-label supervised topic model on gene ontology , 2017 .

[79]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..

[80]  John B. O. Mitchell,et al.  From sequence to enzyme mechanism using multi-label machine learning , 2014, BMC Bioinformatics.

[81]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[82]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[83]  Evangelia I. Zacharaki Prediction of protein function using a deep convolutional neural network ensemble (#12536) , 2017 .

[84]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[85]  Dennis Shasha,et al.  Negative Example Selection for Protein Function Prediction: The NoGO Database , 2014, PLoS Comput. Biol..

[86]  Penny J. Beuning,et al.  Biochemical functional predictions for protein structures of unknown or uncertain function , 2015, Computational and structural biotechnology journal.

[87]  Hong Yu,et al.  Exploiting PubMed for Protein Molecular Function Prediction via NMF Based Multi-label Classification , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[88]  Kuo-Chen Chou,et al.  Predicting Functions of Proteins in Mouse Based on Weighted Protein-Protein Interaction Network and Protein Hybrid Properties , 2011, PloS one.

[89]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[90]  Xueliang Liu,et al.  Deep Recurrent Neural Network for Protein Function Prediction from Sequence , 2017, bioRxiv.

[91]  Júlio C. Nievola,et al.  Multi-Label Hierarchical Classification using a Competitive Neural Network for protein function prediction , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[92]  Nikos Paragios,et al.  Automatic single- and multi-label enzymatic function prediction by machine learning , 2017, PeerJ.

[93]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[94]  K. Chou,et al.  Prediction of protein signal sequences and their cleavage sites , 2001, Proteins.

[95]  Chittibabu Guda,et al.  Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis , 2006, Genom. Proteom. Bioinform..

[96]  Jenn-Kang Hwang,et al.  Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[97]  Zhen Ji,et al.  Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set , 2014, BMC Bioinformatics.

[98]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[99]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[100]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[101]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[102]  Junwen Wang,et al.  NRProF: Neural response based protein function prediction algorithm , 2011, 2011 IEEE International Conference on Systems Biology (ISB).

[103]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[104]  Goran Nenadic,et al.  Mining protein function from text using term-based support vector machines , 2005, BMC Bioinformatics.

[105]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Reduction strategies for hierarchical multi-label classification in protein function prediction , 2016, BMC Bioinformatics.

[106]  Zhi-Xin Liu,et al.  Using protein granularity to extract the protein sequence features. , 2013, Journal of theoretical biology.

[107]  Yi Xiong,et al.  GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank , 2017, bioRxiv.

[108]  Cristiane Neri Nobre,et al.  Algorithms Analysis in Adjusting the SVM Parameters: An Approach in the Prediction of Protein Function , 2017, Appl. Artif. Intell..

[109]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[110]  Rui Fa,et al.  Predicting human protein function with multi-task deep neural networks , 2018, bioRxiv.

[111]  Ying Ju,et al.  Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier , 2016, Scientific Reports.

[112]  Geoffrey I. Webb,et al.  GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome , 2015, Bioinform..

[113]  Keun Ho Ryu,et al.  Identification of protein functions using a machine-learning approach based on sequence-derived properties , 2009, Proteome Science.

[114]  Alok Choudhary,et al.  Enzyme Function Classification Using Protein Sequence Features and Random Forest , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[115]  Juliana S Bernardes,et al.  A review of protein function prediction under machine learning perspective. , 2013, Recent patents on biotechnology.

[116]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[117]  Carlos Fernandez-Lozano,et al.  Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models , 2015, Journal of theoretical biology.

[118]  A. Benczúr,et al.  Prediction and characterization of human ageing-related proteins by using machine learning , 2018, Scientific Reports.

[119]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[120]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[121]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[122]  Jack Y. Yang,et al.  Assessing protein function using a combination of supervised and unsupervised learning , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[123]  Luhua Lai,et al.  Sequence-based prediction of protein protein interaction using a deep-learning algorithm , 2017, BMC Bioinformatics.

[124]  Alan Christoffels,et al.  Prediction of human-Bacillus anthracis protein–protein interactions using multi-layer neural network , 2018, Bioinform..

[125]  Yu-Yen Ou,et al.  DeepEfflux: a 2D convolutional neural network model for identifying families of efflux proteins in transporters , 2018, Bioinform..

[126]  Daoqiang Zhang,et al.  Human cell structure-driven model construction for predicting protein subcellular location from biological images , 2015, Bioinform..

[127]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[128]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[129]  Y. Z. Chen,et al.  Protein function classification via support vector machine approach. , 2003, Mathematical biosciences.

[130]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Multi-label Feature Selection Techniques for Hierarchical Multi-label Protein Function Prediction , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[131]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[132]  Long Zhang,et al.  Protein-Protein Interactions Prediction Using a Novel Local Conjoint Triad Descriptor of Amino Acid Sequences , 2017, International journal of molecular sciences.

[133]  Maxat Kulmanov,et al.  DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier , 2017, Bioinform..

[134]  Barak A. Pearlmutter Learning State Space Trajectories in Recurrent Neural Networks , 1989, Neural Computation.

[135]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[136]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[137]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[138]  Keun Ho Ryu,et al.  Feature Extraction from Protein Sequences and Classification of Enzyme Function , 2008, 2008 International Conference on BioMedical Engineering and Informatics.

[139]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[140]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[141]  Qiao Shan-pin,et al.  Review of protein subcellular localization prediction , 2014 .

[142]  Zheng-Zhi Wang,et al.  Using Logistic Regression Method to Predict Protein Function from Protein-Protein Interaction Data , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[143]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[144]  Theresa Beaubouef,et al.  Rough Sets , 2019, Lecture Notes in Computer Science.

[145]  Maria Jesus Martin,et al.  ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature , 2018, BMC Bioinformatics.

[146]  Qing Chang,et al.  Feature selection methods for big data bioinformatics: A survey from the search perspective. , 2016, Methods.

[147]  Hao Lin,et al.  Prediction of subcellular location of mycobacterial protein using feature selection techniques , 2010, Molecular Diversity.

[148]  Catherine Blake,et al.  Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles , 2015, J. Biomed. Informatics.

[149]  Alex Alves Freitas,et al.  An Efficient Algorithm for Hierarchical Classification of Protein and Gene Functions , 2014, 2014 25th International Workshop on Database and Expert Systems Applications.

[150]  Chengjin Zhang,et al.  A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data , 2016, International journal of molecular sciences.

[151]  Alexandre M J J Bonvin,et al.  SpotOn: High Accuracy Identification of Protein-Protein Interface Hot-Spots , 2017, Scientific Reports.

[152]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.