Protein function prediction with high-throughput data

Protein function prediction is one of the main challenges in post-genomic era. The availability of large amounts of high-throughput data provides an alternative approach to handling this problem from the computational viewpoint. In this review, we provide a comprehensive description of the computational methods that are currently applicable to protein function prediction, especially from the perspective of machine learning. Machine learning techniques can generally be classified as supervised learning, semi-supervised learning and unsupervised learning. By classifying the existing computational methods for protein annotation into these three groups, we are able to present a comprehensive framework on protein annotation based on machine learning techniques. In addition to describing recently developed theoretical methodologies, we also cover representative databases and software tools that are widely utilized in the prediction of protein function.

[1]  Ignacio Marín,et al.  Iterative Cluster Analysis of Protein Interaction Data , 2005, Bioinform..

[2]  F. Tan,et al.  Prediction of mitochondrial proteins based on genetic algorithm – partial least squares and support vector machine , 2007, Amino Acids.

[3]  Z. Huang,et al.  Using complexity measure factor to predict protein subcellular location , 2005, Amino Acids.

[4]  Meng Wang,et al.  Using Fourier Spectrum Analysis and Pseudo Amino Acid Composition for Prediction of Membrane Protein Types , 2005, The protein journal.

[5]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[6]  M. Wang,et al.  Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. , 2004, Protein engineering, design & selection : PEDS.

[7]  Qianzhong Li,et al.  Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components , 2007, J. Comput. Chem..

[8]  Howard Leung,et al.  Prediction of membrane protein types from sequences and position-specific scoring matrices. , 2007, Journal of theoretical biology.

[9]  Ponnuthurai N. Suganthan,et al.  A machine learning approach for the identification of odorant binding proteins from sequence-derived properties , 2007, BMC Bioinformatics.

[10]  Guo-Ping Zhou,et al.  An Intriguing Controversy over Protein Structural Class Prediction , 1998, Journal of protein chemistry.

[11]  Kuo-Chen Chou,et al.  Large-scale predictions of gram-negative bacterial protein subcellular locations. , 2006, Journal of proteome research.

[12]  X.-D. Sun,et al.  Prediction of protein structural classes using support vector machines , 2006, Amino Acids.

[13]  H.-B. Shen,et al.  Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction , 2007, Amino Acids.

[14]  F.-M. Li,et al.  Using pseudo amino acid composition to predict protein subnuclear location with improved hybrid approach , 2007, Amino Acids.

[15]  Kuo-Chen Chou,et al.  Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. , 2003, Biochemical and biophysical research communications.

[16]  Haidong Wang,et al.  Discovering molecular pathways from protein interaction and gene expression data , 2003, ISMB.

[17]  Yanzhi Guo,et al.  Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features , 2007, Amino Acids.

[18]  Ichigaku Takigawa,et al.  Annotating gene function by combining expression data with a modular gene network , 2007, ISMB/ECCB.

[19]  Bernhard Schölkopf,et al.  Fast protein classification with multiple networks , 2005, ECCB/JBI.

[20]  Kuo-Chen Chou,et al.  Prediction of Membrane Protein Types by Incorporating Amphipathic Effects , 2005, J. Chem. Inf. Model..

[21]  L. Mirny,et al.  Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[23]  K. Chou,et al.  Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. , 2007, Protein engineering, design & selection : PEDS.

[24]  H. Mewes,et al.  Functional modules by relating protein interaction networks and gene expression. , 2003, Nucleic acids research.

[25]  S. Fields,et al.  A novel genetic system to detect protein–protein interactions , 1989, Nature.

[26]  Igor Jurisica,et al.  Functional topology in a network of protein interactions , 2004, Bioinform..

[27]  Ying-Li Chen,et al.  Prediction of the subcellular location of apoptosis proteins. , 2007, Journal of theoretical biology.

[28]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[29]  Zhen-Hui Zhang,et al.  A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine , 2006, FEBS letters.

[30]  Kuo-Chen Chou,et al.  Predicting protein structural class with AdaBoost Learner. , 2006, Protein and peptide letters.

[31]  Alexander Rives,et al.  Modular organization of cellular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[32]  M. Wang,et al.  Low-frequency Fourier spectrum for predicting membrane protein types. , 2005, Biochemical and biophysical research communications.

[33]  Jiang Wang,et al.  Prediction of protein structural class with Rough Sets , 2006, BMC Bioinformatics.

[34]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[35]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[36]  David Martin,et al.  Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network , 2003, Genome Biology.

[37]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[38]  I-Min A. Dubchak,et al.  A computational approach to identify genes for functional RNAs in genomic sequences. , 2001, Nucleic acids research.

[39]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[40]  Xing-Ming Zhao,et al.  Classifying protein sequences using hydropathy blocks , 2006, Pattern Recognit..

[41]  Kuo-Chen Chou,et al.  Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types. , 2005, Biochemical and biophysical research communications.

[42]  Loris Nanni,et al.  Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization , 2008, Amino Acids.

[43]  Peixiang Cai,et al.  Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network. , 2006, Analytical biochemistry.

[44]  Luonan Chen,et al.  Discovering functions and revealing mechanisms at molecular level from biological networks , 2007, Proteomics.

[45]  Sung-Hou Kim,et al.  Global mapping of the protein structure space and application in structure-based inference of protein function. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Gultekin Özsoyoglu,et al.  Annotating proteins by mining protein interaction networks , 2006, ISMB.

[47]  Olga G. Troyanskaya,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm332 Data and text mining , 2022 .

[48]  Frank Dudbridge,et al.  The Use of Edge-Betweenness Clustering to Investigate Biological Function in Protein Interaction Networks , 2005, BMC Bioinformatics.

[49]  Ying-Li Chen,et al.  Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition. , 2007, Journal of theoretical biology.

[50]  H.-B. Shen,et al.  Using ensemble classifier to identify membrane protein types , 2006, Amino Acids.

[51]  Guo-Ping Zhou,et al.  Subcellular location prediction of apoptosis proteins , 2002, Proteins.

[52]  K. Chou,et al.  Predicting protein-protein interactions from sequences in a hybridization space. , 2006, Journal of proteome research.

[53]  K. Chou,et al.  Prediction of linear B-cell epitopes using amino acid pair antigenicity scale , 2007, Amino Acids.

[54]  Mona Singh,et al.  Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps , 2005, ISMB.

[55]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[56]  Z. Huang,et al.  Using pseudo amino acid composition to predict protein subcellular location: Approached with Lyapunov index, Bessel function, and Chebyshev filter , 2005, Amino Acids.

[57]  S.-W. Zhang,et al.  Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition , 2007, Amino Acids.

[58]  M. Samanta,et al.  Predicting protein functions from redundancies in large-scale protein interaction networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Ying Huang,et al.  Prediction of protein subcellular locations using fuzzy k-NN method , 2004, Bioinform..

[60]  Kuo-Chen Chou,et al.  Large‐scale plant protein subcellular location prediction , 2007, Journal of cellular biochemistry.

[61]  Xing-Ming Zhao,et al.  Gene Function Prediction with the Shortest Path in Functional Linkage Graph , 2007 .

[62]  Zhanchao Li,et al.  Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. , 2007, Journal of theoretical biology.

[63]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[64]  Da-Peng Li,et al.  Amino Acid Principal Component Analysis (AAPCA) and its Applications in Protein Structural Class Prediction , 2006, Journal of biomolecular structure & dynamics.

[65]  Haifeng Li,et al.  Systematic discovery of functional modules and context-specific functional annotation of human genome , 2007, ISMB/ECCB.

[66]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[67]  Bhaskar D. Kulkarni,et al.  Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM , 2007, Pattern Recognit. Lett..

[68]  S. Brunak,et al.  Locating proteins in the cell using TargetP, SignalP and related tools , 2007, Nature Protocols.

[69]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[70]  S. Kasif,et al.  Whole-genome annotation by using evidence integration in functional-linkage networks. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[71]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[72]  K. Chou,et al.  Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. , 2007, Journal of proteome research.

[73]  Limsoon Wong,et al.  An efficient strategy for extensive integration of diverse biological data for protein function prediction , 2007, Bioinform..

[74]  M. Gerstein,et al.  Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. , 2002, Genome research.

[75]  Q. Pan,et al.  Using pseudo amino acid composition to predict protein subcellular location: approached with amino acid composition distribution , 2008, Amino Acids.

[76]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[77]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[78]  K. Chou,et al.  EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. , 2007, Biochemical and biophysical research communications.

[79]  D.-S. Huang,et al.  Radial Basis Probabilistic Neural Networks: Model and Application , 1999, Int. J. Pattern Recognit. Artif. Intell..

[80]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[81]  H.-B. Shen,et al.  Predicting secretory protein signal sequence cleavage sites by fusing the marks of global alignments , 2006, Amino Acids.

[82]  Kuo-Chen Chou,et al.  Predicting enzyme subclass by functional domain composition and pseudo amino acid composition. , 2005, Journal of proteome research.

[83]  Kuo-Chen Chou,et al.  Predicting enzyme family classes by hybridizing gene product composition and pseudo-amino acid composition. , 2005, Journal of theoretical biology.

[84]  Xiaoyong Zou,et al.  Using pseudo-amino acid composition and support vector machine to predict protein structural class. , 2006, Journal of theoretical biology.

[85]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[86]  Limsoon Wong,et al.  Exploiting indirect neighbours and topological weight to predict protein function from protein--protein interactions , 2006 .

[87]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[88]  K. Chou,et al.  Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. , 2007, Biopolymers.

[89]  Yanda Li,et al.  Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence , 2006, BMC Bioinformatics.

[90]  Kuo-Chen Chou,et al.  Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition. , 2005, Biochemical and biophysical research communications.

[91]  Kuo-Bin Li,et al.  AAIndexLoc: predicting subcellular localization of proteins based on a new representation of sequences using amino acid indices , 2008, Amino Acids.

[92]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[93]  G P Zhou,et al.  Some insights into protein structural class prediction , 2001, Proteins.

[94]  Vipin Kumar,et al.  Association analysis-based transformations for protein interaction networks: a function prediction case study , 2007, KDD '07.

[95]  Loris Nanni,et al.  Combing ontologies and dipeptide composition for predicting DNA-binding proteins , 2007, Amino Acids.

[96]  Z. Wen,et al.  Delaunay triangulation with partial least squares projection to latent structures: a model for G-protein coupled receptors classification and fast structure recognition , 2007, Amino Acids.

[97]  Chun Yan,et al.  Prediction of protein subcellular location using a combined feature of sequence , 2005, FEBS letters.

[98]  T. M. Murali,et al.  VIRGO: computational prediction of gene functions , 2006, Nucleic Acids Res..

[99]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[100]  Alessandro Vespignani,et al.  Global protein function prediction from protein-protein interaction networks , 2003, Nature Biotechnology.

[101]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[102]  Ting Chen,et al.  Diffusion kernel-based logistic regression models for protein function prediction. , 2006, Omics : a journal of integrative biology.

[103]  Hao Lin,et al.  Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant. , 2007, Biochemical and biophysical research communications.

[104]  Anton J. Enright,et al.  Detection of functional modules from protein interaction networks , 2003, Proteins.

[105]  K. Chou,et al.  Using LogitBoost classifier to predict protein structural classes. , 2006, Journal of theoretical biology.

[106]  Sukanta Mondal,et al.  Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification. , 2006, Journal of theoretical biology.

[107]  X.-B. Zhou,et al.  Improved prediction of subcellular location for apoptosis proteins by the dual-layer support vector machine , 2008, Amino Acids.

[108]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[109]  Xing-Ming Zhao,et al.  Gene function prediction using labeled and unlabeled data , 2008, BMC Bioinformatics.

[110]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[111]  Igor Jurisica,et al.  Protein complex prediction via cost-based clustering , 2004, Bioinform..

[112]  Stanley Letovsky,et al.  Predicting protein function from protein/protein interaction data: a probabilistic approach , 2003, ISMB.

[113]  Jinze Liu,et al.  Structure‐based function inference using protein family‐specific fingerprints , 2006, Protein science : a publication of the Protein Society.

[114]  Zhi-Ping Liu,et al.  Predicting gene ontology functions from protein's regional surface structures , 2007, BMC Bioinformatics.

[115]  Tongliang Zhang,et al.  Using pseudo amino acid composition and binary-tree support vector machines to predict protein structural classes , 2007, Amino Acids.

[116]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[117]  G. Li,et al.  Classifying G protein-coupled receptors and nuclear receptors on the basis of protein power spectrum from fast Fourier transform , 2006, Amino Acids.

[118]  Z. Huang,et al.  Using cellular automata images and pseudo amino acid composition to predict protein subcellular location , 2005, Amino Acids.

[119]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[120]  Kuo-Chen Chou,et al.  Using pseudo amino acid composition to predict protein structural classes: Approached with complexity measure factor , 2006, J. Comput. Chem..

[121]  K. Chou,et al.  Digital coding of amino acids based on hydrophobic index. , 2007, Protein and peptide letters.

[122]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[123]  Xiangjun Liu,et al.  GNBSL: A new integrative system to predict the subcellular location for Gram‐negative bacteria proteins , 2006, Proteomics.

[124]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[125]  W. Wong,et al.  Transitive functional annotation by shortest-path analysis of gene expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[126]  W. Wong,et al.  Functional annotation and network reconstruction through cross-platform integration of microarray data , 2005, Nature Biotechnology.

[127]  K. Chou,et al.  Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. , 2006, Biochemical and biophysical research communications.

[128]  K. Chou,et al.  Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. , 2007, Biochemical and biophysical research communications.

[129]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[130]  K. Chou,et al.  Signal-3L: A 3-layer approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[131]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[132]  S.-W. Zhang,et al.  Prediction of protein homo-oligomer types by pseudo amino acid composition: Approached with an improved feature extraction and Naive Bayes Feature Fusion , 2006, Amino Acids.

[133]  See-Kiong Ng,et al.  On combining multiple microarray studies for improved functional classification by whole-dataset feature selection. , 2003, Genome informatics. International Conference on Genome Informatics.

[134]  Parviz Abdolmaleki,et al.  Novel two-stage hybrid neural discriminant model for predicting proteins structural classes. , 2007, Biophysical chemistry.

[135]  Kuo-Chen Chou,et al.  Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. , 2007, Protein and peptide letters.

[136]  Yi Pan,et al.  The community structure of human cellular signaling network , 2007, Journal of Theoretical Biology.

[137]  Kui Zhang,et al.  Prediction of protein function using protein-protein interaction data , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[138]  Z. Wen,et al.  Using pseudo amino acid composition to predict transmembrane regions in protein: cellular automata and Lempel-Ziv complexity , 2007, Amino Acids.

[139]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[140]  Luonan Chen,et al.  Revealing divergent evolution, identifying circular permutations and detecting active-sites by protein structure comparison , 2006, BMC Structural Biology.

[141]  Dong Xu,et al.  Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. , 2004, Nucleic acids research.

[142]  Zheng-Zhi Wang,et al.  Classification of G-protein coupled receptors at four levels. , 2006, Protein engineering, design & selection : PEDS.

[143]  Shao-Wu Zhang,et al.  Using the concept of Chou’s pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies , 2008, Amino Acids.

[144]  Scott Dick,et al.  Classifier ensembles for protein structural class prediction with varying homology. , 2006, Biochemical and biophysical research communications.

[145]  Kuo-Chen Chou,et al.  Using stacked generalization to predict membrane protein types based on pseudo-amino acid composition. , 2006, Journal of theoretical biology.

[146]  Kuo-Chen Chou,et al.  Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[147]  S. Fields,et al.  The two-hybrid system: a method to identify and clone genes for proteins that interact with a protein of interest. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[148]  Kuo-Chen Chou,et al.  Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. , 2007, Protein engineering, design & selection : PEDS.

[149]  Yongsheng Ding,et al.  Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence , 2006, Comput. Biol. Chem..

[150]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[151]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[152]  Kuo-Chen Chou,et al.  Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. , 2006, Journal of theoretical biology.

[153]  Kuo-Chen Chou,et al.  Correlations of amino acids in proteins , 2003, Peptides.

[154]  K.-C. Chou,et al.  Using string kernel to predict signal peptide cleavage site based on subsite coupling model , 2005, Amino Acids.

[155]  Kuo-Chen Chou,et al.  Predicting membrane protein type by functional domain composition and pseudo-amino acid composition. , 2006, Journal of theoretical biology.

[156]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[157]  Jishou Ruan,et al.  Novel scales based on hydrophobicity indices for secondary protein structure. , 2007, Journal of theoretical biology.

[158]  Xing-Ming Zhao,et al.  A novel approach to extracting features from motif content and protein composition for protein sequence classification , 2005, Neural Networks.

[159]  T. Takagi,et al.  Assessment of prediction accuracy of protein function from protein–protein interaction data , 2001, Yeast.

[160]  Eytan Domany,et al.  Superparamagnetic Clustering of Data , 1996 .