Multilabel Learning via Random Label Selection for Protein Subcellular Multilocations Prediction

Prediction of protein subcellular localization is an important but challenging problem, particularly when proteins may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular localization methods are only used to deal with the single-location proteins. In the past few years, only a few methods have been proposed to tackle proteins with multiple locations. However, they only adopt a simple strategy, that is, transforming the multilocation proteins to multiple proteins with single location, which does not take correlations among different subcellular locations into account. In this paper, a novel method named random label selection (RALS) (multilabel learning via RALS), which extends the simple binary relevance (BR) method, is proposed to learn from multilocation proteins in an effective and efficient way. RALS does not explicitly find the correlations among labels, but rather implicitly attempts to learn the label correlations from data by augmenting original feature space with randomly selected labels as its additional input features. Through the fivefold cross-validation test on a benchmark data set, we demonstrate our proposed method with consideration of label correlations obviously outperforms the baseline BR method without consideration of label correlations, indicating correlations among different subcellular locations really exist and contribute to improvement of prediction performance. Experimental results on two benchmark data sets also show that our proposed methods achieve significantly higher performance than some other state-of-the-art methods in predicting subcellular multilocations of proteins. The prediction web server is available at http://levis.tongji.edu.cn:8080/bioinfo/MLPred-Euk/ for the public usage.

[1]  Shuigeng Zhou,et al.  Gene ontology based transfer learning for protein subcellular localization , 2011, BMC Bioinformatics.

[2]  K. Chou,et al.  Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. , 2007, Biochemical and biophysical research communications.

[3]  Xiaoying Jiang,et al.  Using the concept of Chou's pseudo amino acid composition to predict apoptosis proteins subcellular location: an approach by approximate entropy. , 2008, Protein and peptide letters.

[4]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[5]  Dongsheng Zou,et al.  Supersecondary structure prediction using Chou's pseudo amino acid composition , 2011, J. Comput. Chem..

[6]  Ying Huang,et al.  Prediction of protein subcellular locations using fuzzy k-NN method , 2004, Bioinform..

[7]  K. Chou,et al.  Protein subcellular location prediction. , 1999, Protein engineering.

[8]  M. Bhasin,et al.  Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search* , 2005, Journal of Biological Chemistry.

[9]  Zhanchao Li,et al.  Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. , 2007, Journal of theoretical biology.

[10]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[11]  Wen-Lian Hsu,et al.  Protein subcellular localization prediction of eukaryotes using a knowledge-based approach , 2009 .

[12]  Wenqi Liu,et al.  Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites , 2012, PloS one.

[13]  Yan Wang,et al.  Using a novel AdaBoost algorithm and Chou's Pseudo amino acid composition for predicting protein subcellular localization. , 2011, Protein and peptide letters.

[14]  Yan Chen,et al.  Embedded Feature Selection for Multi-label Classification of Music Emotions , 2012, Int. J. Comput. Intell. Syst..

[15]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[16]  Gajendra P S Raghava,et al.  SVM based prediction of RNA‐binding proteins using binding residues and evolutionary information , 2011, Journal of molecular recognition : JMR.

[17]  K. Chou,et al.  Plant-mPLoc: A Top-Down Strategy to Augment the Power for Predicting Plant Protein Subcellular Localization , 2010, PloS one.

[18]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[19]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[20]  Asifullah Khan,et al.  CE-PLoc: An ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition , 2011, Comput. Biol. Chem..

[21]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[22]  Xiaohui S. Xie,et al.  A Mammalian Organelle Map by Protein Correlation Profiling , 2006, Cell.

[23]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[24]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[25]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[26]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[27]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[28]  Sang-Mun Chi,et al.  Prediction of protein subcellular localization by weighted gene ontology terms. , 2010, Biochemical and biophysical research communications.

[29]  Qiang Yang,et al.  Semi-supervised protein subcellular localization , 2009, BMC Bioinformatics.

[30]  Loris Nanni,et al.  Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization , 2008, Amino Acids.

[31]  Xiaoyong Zou,et al.  Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. , 2009, Protein and peptide letters.

[32]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[33]  Gary Geunbae Lee,et al.  Subcellular Localization Prediction through Boosting Association Rules , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  Jason Weston,et al.  Kernel methods for Multi-labelled classification and Categ orical regression problems , 2001, NIPS 2001.

[35]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[36]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[37]  You-Shao Wang,et al.  A novel ensemble and composite approach for classifying proteins based on Chou’s pseudo amino acid composition , 2011 .

[38]  Hong-Bin Shen,et al.  Multi Label Learning for Prediction of Human Protein Subcellular Localizations , 2009, The protein journal.

[39]  Kuo-Chen Chou,et al.  A New Method for Predicting the Subcellular Localization of Eukaryotic Proteins with Both Single and Multiple Sites: Euk-mPLoc 2.0 , 2010, PloS one.

[40]  Yongsheng Ding,et al.  Using Chou's pseudo amino acid composition to predict subcellular localization of apoptosis proteins: An approach with immune genetic algorithm-based ensemble classifier , 2008, Pattern Recognit. Lett..

[41]  Shiow-Fen Hwang,et al.  ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization , 2008, BMC Bioinformatics.

[42]  Jianding Qiu,et al.  Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. , 2010, Protein and peptide letters.

[43]  Jenn-Kang Hwang,et al.  Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[44]  Shou-De Lin,et al.  Cost-Sensitive Multi-Label Learning for Audio Tag Annotation and Retrieval , 2011, IEEE Transactions on Multimedia.

[45]  Suyu Mei,et al.  Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning. , 2012, Journal of theoretical biology.

[46]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[47]  Yu-Yen Ou,et al.  Protein disorder prediction by condensed PSSM considering propensity for order or disorder , 2006, BMC Bioinformatics.

[48]  Hagit Shatkay,et al.  SherLoc2: a high-accuracy hybrid method for predicting subcellular localization of proteins. , 2009, Journal of proteome research.

[49]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[50]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[51]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[52]  Yanzhi Guo,et al.  Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. , 2009, Journal of theoretical biology.

[53]  Kuo-Chen Chou,et al.  A Multi-Label Classifier for Predicting the Subcellular Localization of Gram-Negative Bacterial Proteins with Both Single and Multiple Sites , 2011, PloS one.

[54]  K. Chou,et al.  Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. , 2007, Journal of proteome research.

[55]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[56]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[57]  Yuan Zhang,et al.  Prediction of protein subcellular multi-localization based on the general form of Chou's pseudo amino acid composition. , 2012, Protein and peptide letters.

[58]  Fengmin Li,et al.  Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach. , 2008, Protein and peptide letters.

[59]  Hao Lin The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. , 2008, Journal of theoretical biology.

[60]  Guo-Zheng Li,et al.  Virus-ECC-mPLoc: a multi-label predictor for predicting the subcellular localization of virus proteins with both single and multiple sites based on a general form of Chou's pseudo amino acid composition. , 2013, Protein and peptide letters.

[61]  Kuo-Chen Chou,et al.  A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. , 2009, Analytical biochemistry.

[62]  Dinesh Gupta,et al.  Identifying Bacterial Virulent Proteins by Fusing a Set of Classifiers Based on Variants of Chou's Pseudo Amino Acid Composition and on Evolutionary Information , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[63]  K. Chou,et al.  Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms , 2010 .

[64]  K. Chou,et al.  iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. , 2011, Molecular bioSystems.

[65]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[66]  Oliver Kohlbacher,et al.  Going from where to why—interpretable prediction of protein subcellular localization , 2010, Bioinform..

[67]  Guangya Zhang,et al.  Predicting lipase types by improved Chou's pseudo-amino acid composition. , 2008, Protein and peptide letters.

[68]  Gajendra P.S. Raghava,et al.  Prediction of RNA binding sites in a protein using SVM and PSSM profile , 2008, Proteins.

[69]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[70]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[71]  C. Tanford Contribution of Hydrophobic Interactions to the Stability of the Globular Conformation of Proteins , 1962 .

[72]  K. Chou,et al.  iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. , 2013, Molecular bioSystems.

[73]  Guo-Zheng Li,et al.  A Multi-Label Predictor for Identifying the Subcellular Locations of Singleplex and Multiplex Eukaryotic Proteins , 2012, PloS one.

[74]  Qiang Yang,et al.  Multitask Learning for Protein Subcellular Location Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[75]  K. Chou,et al.  iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. , 2011, Journal of theoretical biology.

[76]  P. Aloy,et al.  Relation between amino acid composition and cellular location of proteins. , 1997, Journal of molecular biology.

[77]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[78]  Song Zhang,et al.  DBMLoc: a Database of proteins with multiple subcellular localizations , 2008, BMC Bioinformatics.

[79]  K. Chou,et al.  iLoc-Gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins. , 2012, Protein and peptide letters.

[80]  Amanda Clare,et al.  Knowledge Discovery in Multi-label Phenotype Data , 2001, PKDD.

[81]  Xiaoqi Zheng,et al.  Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles , 2011, Amino Acids.

[82]  K. R. Woods,et al.  Prediction of protein antigenic determinants from amino acid sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[83]  Wing-Kin Sung,et al.  Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines , 2005, BMC Bioinformatics.