Prediction of protein subcellular localization by weighted gene ontology terms.

We develop a new weighting approach of gene ontology (GO) terms for predicting protein subcellular localization. The weights of individual GO terms, corresponding to their contribution to the prediction algorithm, are determined by the term-weighting methods used in text categorization. We evaluate several term-weighting methods, which are based on inverse document frequency, information gain, gain ratio, odds ratio, and chi-square and its variants. Additionally, we propose a new term-weighting method based on the logarithmic transformation of chi-square. The proposed term-weighting method performs better than other term-weighting methods, and also outperforms state-of-the-art subcellular prediction methods. Our proposed method achieves 98.1%, 99.3%, 98.1%, 98.1%, and 95.9% overall accuracies for the animal BaCelLo independent dataset (IDS), fungal BaCelLo IDS, animal Höglund IDS, fungal Höglund IDS, and PLOC dataset, respectively. Furthermore, the close correlation between high-weighted GO terms and subcellular localizations suggests that our proposed method appropriately weights GO terms according to their relevance to the localizations.

[1]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[2]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[3]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[4]  Yang Dai,et al.  Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction , 2006, BMC Bioinformatics.

[5]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[6]  Burkhard Rost,et al.  Inferring sub-cellular localization through automated lexical analysis , 2002, ISMB.

[7]  K. Chou,et al.  Prediction of protein subcellular locations by GO-FunD-PseAA predictor. , 2004, Biochemical and biophysical research communications.

[8]  Hagit Shatkay,et al.  SherLoc2: a high-accuracy hybrid method for predicting subcellular localization of proteins. , 2009, Journal of proteome research.

[9]  B. Rost,et al.  Mimicking cellular sorting improves prediction of subcellular localization. , 2005, Journal of molecular biology.

[10]  Oliver Kohlbacher,et al.  MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition , 2006, Bioinform..

[11]  S. Brunak,et al.  Locating proteins in the cell using TargetP, SignalP and related tools , 2007, Nature Protocols.

[12]  Piero Fariselli,et al.  BaCelLo: a balanced subcellular localization predictor , 2006, ISMB.

[13]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[14]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[15]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[16]  Oliver Kohlbacher,et al.  MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction , 2009, BMC Bioinformatics.

[17]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[18]  Hagit Shatkay,et al.  Pacific Symposium on Biocomputing 13:604-615(2008) EPILOC: A (WORKING) TEXT-BASED SYSTEM FOR PREDICTING PROTEIN SUBCELLULAR LOCATION , 2022 .

[19]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[20]  Shiow-Fen Hwang,et al.  ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization , 2008, BMC Bioinformatics.

[21]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[24]  Ao Li,et al.  LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST , 2005, Nucleic Acids Res..

[25]  R. Casadio,et al.  The prediction of protein subcellular localization from sequence: a shortcut to functional genome annotation. , 2008, Briefings in functional genomics & proteomics.

[26]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[27]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[28]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[29]  Duane Szafron,et al.  Improving subcellular localization prediction using text classification and the gene ontology , 2008, Bioinform..

[30]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.