A New Balanced Ensemble Classifier for Predicting Fungi Protein Subcellular Localization Based on Protein Primary Structures

Protein subcellular localization provides insights into protein function. In literature, various computational methods have been developed for this problem based on protein sequences, where most methods have limited prediction accuracy. Therefore, a general computational method with high prediction accuracy is necessary. In this work, we present a novel balanced ensemble classifier for fungi protein subcellular localization prediction based only on protein sequences. We make three fold contributions to this filed. First, we present a new algorithm to cope with imbalance problem that arises in protein subcellular localization prediction, which can improve prediction accuracy significantly. Second, we employ feature selection techniques to find out most informative features for each compartment, and reduce computation cost and improve prediction accuracy at the same time. Third, an ensemble classifier combing outputs from distinct classifiers is presented to further improve prediction accuracy. The numerical results on benchmark dataset demonstrate the efficiency and effectiveness of the proposed method.

[1]  Peer Bork,et al.  Predicting protein cellular localization using a domain projection method. , 2002, Genome research.

[2]  P. Aloy,et al.  Relation between amino acid composition and cellular location of proteins. , 1997, Journal of molecular biology.

[3]  Trey Ideker,et al.  Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species , 2008, Nucleic acids research.

[4]  Xing-Ming Zhao,et al.  Classifying protein sequences using hydropathy blocks , 2006, Pattern Recognit..

[5]  Tongbin Li,et al.  Meta-prediction of protein subcellular localization with reduced voting , 2007, Nucleic acids research.

[6]  Tatsuya Akutsu,et al.  Subcellular location prediction of proteins using support vector machines with alignment of block sequences utilizing amino acid composition , 2007, BMC Bioinformatics.

[7]  K. R. Woods,et al.  Prediction of protein antigenic determinants from amino acid sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Xing-Ming Zhao,et al.  Gene function prediction using labeled and unlabeled data , 2008, BMC Bioinformatics.

[9]  Xing-Ming Zhao,et al.  A novel approach to extracting features from motif content and protein composition for protein sequence classification , 2005, Neural Networks.

[10]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[11]  Wen-Lian Hsu,et al.  PSLDoc: Protein subcellular localization prediction based on gapped‐dipeptides and probabilistic latent semantic analysis , 2008, Proteins.

[12]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[13]  Michelle S. Scott,et al.  Predicting subcellular localization via protein motif co-occurrence. , 2004, Genome research.

[14]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[15]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[16]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[17]  Kuo-Chen Chou,et al.  Predicting subcellular localization of proteins in a hybridization space , 2004, Bioinform..

[18]  Hu Chen,et al.  SubLoc: a server/client suite for protein subcellular location based on SOAP , 2006, Bioinform..

[19]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[20]  Piero Fariselli,et al.  BaCelLo: a balanced subcellular localization predictor , 2006, ISMB.

[21]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[22]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[23]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[24]  C. Tanford Contribution of Hydrophobic Interactions to the Stability of the Globular Conformation of Proteins , 1962 .