ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization

BackgroundGene Ontology (GO) annotation, which describes the function of genes and gene products across species, has recently been used to predict protein subcellular and subnuclear localization. Existing GO-based prediction methods for protein subcellular localization use the known accession numbers of query proteins to obtain their annotated GO terms. An accurate prediction method for predicting subcellular localization of novel proteins without known accession numbers, using only the input sequence, is worth developing.ResultsThis study proposes an efficient sequence-based method (named ProLoc-GO) by mining informative GO terms for predicting protein subcellular localization. For each protein, BLAST is used to obtain a homology with a known accession number to the protein for retrieving the GO annotation. A large number n of all annotated GO terms that have ever appeared are then obtained from a large set of training proteins. A novel genetic algorithm based method (named GOmining) combined with a classifier of support vector machine (SVM) is proposed to simultaneously identify a small number m out of the n GO terms as input features to SVM, where m <<n. The m informative GO terms contain the essential GO terms annotating subcellular compartments such as GO:0005634 (Nucleus), GO:0005737 (Cytoplasm) and GO:0005856 (Cytoskeleton). Two existing data sets SCL12 (human protein with 12 locations) and SCL16 (Eukaryotic proteins with 16 locations) with <25% sequence identity are used to evaluate ProLoc-GO which has been implemented by using a single SVM classifier with the m = 44 and m = 60 informative GO terms, respectively. ProLoc-GO using input sequences yields test accuracies of 88.1% and 83.3% for SCL12 and SCL16, respectively, which are significantly better than the SVM-based methods, which achieve < 35% test accuracies using amino acid composition (AAC) with acid pairs and AAC with dipedtide composition. For comparison, ProLoc-GO using known accession numbers of query proteins yields test accuracies of 90.6% and 85.7%, which is also better than Hum-PLoc (85.0%) and Euk-OET-PLoc (83.7%) using ensemble classifiers with hybridization of GO terms and amphiphilic pseudo amino acid composition for SCL12 and SCL16, respectively.ConclusionThe growth of Gene Ontology in size and popularity has increased the effectiveness of GO-based features. GOmining can serve as a tool for selecting informative GO terms in solving sequence-based prediction problems. The prediction system using ProLoc-GO with input sequences of query proteins for protein subcellular localization has been implemented (see Availability).

[1]  Alex Lewin,et al.  BMC Bioinformatics BioMed Central Methodology article Grouping Gene Ontology terms to improve the assessment of gene set enrichment in microarray data , 2006 .

[2]  Hansoo Lee,et al.  Amiloride potentiates TRAIL-induced tumor cell apoptosis by intracellular acidification-dependent Akt inactivation. , 2005, Biochemical and biophysical research communications.

[3]  Jenn-Kang Hwang,et al.  Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[4]  Shinn-Ying Ho,et al.  Intelligent evolutionary algorithms for large parameter optimization problems , 2004, IEEE Trans. Evol. Comput..

[5]  Ying Huang,et al.  Prediction of protein subcellular locations using fuzzy k-NN method , 2004, Bioinform..

[6]  K. Chou,et al.  Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. , 2006, Biochemical and biophysical research communications.

[7]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[8]  Arun Krishnan,et al.  pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties , 2005, BMC Bioinformatics.

[9]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[10]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[11]  Robert Stevens,et al.  Protein classification using ontology classification , 2006, ISMB.

[12]  K. Chou,et al.  Support vector machines for prediction of protein subcellular location by incorporating quasi‐sequence‐order effect , 2002, Journal of cellular biochemistry.

[13]  Shinn-Ying Ho,et al.  Inheritable genetic algorithm for biobjective 0/1 combinatorial optimization problems and its applications , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[14]  Shiow-Fen Hwang,et al.  ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features , 2007, Biosyst..

[15]  M. Bhasin,et al.  Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search* , 2005, Journal of Biological Chemistry.

[16]  Shiow-Fen Hwang,et al.  Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method , 2007, Biosyst..

[17]  Kuo-Chen Chou,et al.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. , 2006, Journal of proteome research.

[18]  Yang Dai,et al.  Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction , 2006, BMC Bioinformatics.

[19]  Yu-Dong Cai,et al.  A novel computational method to predict transcription factor DNA binding preference. , 2006, Biochemical and biophysical research communications.

[20]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[21]  Michael R. Thon,et al.  Identifying clusters of functionally related genes in genomes , 2007, Bioinform..

[22]  James M. Keller,et al.  Fuzzy Measures on the Gene Ontology for Gene Product Similarity , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[24]  Shinn-Ying Ho,et al.  POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties , 2007, Bioinform..

[25]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[26]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[27]  Zhiyong Lu,et al.  GO Molecular Function Terms Are Predictive of Subcellular Localization , 2004, Pacific Symposium on Biocomputing.

[28]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[29]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[30]  Chih-Hung Hsieh,et al.  Interpretable gene expression classifier with an accurate and compact fuzzy rule base for microarray data analysis. , 2006, Bio Systems.

[31]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[32]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[33]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[34]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[35]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[36]  P. Aloy,et al.  Relation between amino acid composition and cellular location of proteins. , 1997, Journal of molecular biology.

[37]  Kuo-Chen Chou,et al.  Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition. , 2005, Biochemical and biophysical research communications.

[38]  Gajendra P. S. Raghava,et al.  PSLpred: prediction of subcellular localization of bacterial proteins , 2005, Bioinform..

[39]  Loris Nanni,et al.  An ensemble of K-local hyperplanes for predicting protein-protein interactions , 2006, Bioinform..

[40]  Shinn-Ying Ho,et al.  Intelligent evolutionary algorithms for large parameter optimization problems , 2004, IEEE Transactions on Evolutionary Computation.

[41]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[42]  K. Chou,et al.  Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. , 2007, Journal of proteome research.

[43]  Vladimir Pavlovic,et al.  Protein classification using probabilistic chain graphs and the Gene Ontology structure , 2006, Bioinform..