Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC

In this article, the possible subcellular location of a protein is predicted using multiobjective particle swarm optimization-based feature selection technique. In general form of pseudo-amino acid composition, the protein sequences are used for constructing protein features. Here, the different amino acids compositions are used to construct the feature sets. Therefore, the data are presented as sample of protein versus amino acid compositions as features. The proposed algorithm tries to maximize the feature relevance and minimize the feature redundancy simultaneously. After proposed algorithm is executed on the multiclass dataset, some features are selected. On this resultant feature subset, tenfold cross-validation is applied and corresponding accuracy, F score, entropy, representation entropy and average correlation are calculated. The performance of the proposed method is compared with that of its single objective versions, sequential forward search, sequential backward search, minimum redundancy maximum relevance with two schemes, CFS, CBFS, $$\chi ^2$$χ2, Fisher discriminant and a Cluster-based technique.

[1]  M Reyes Sierra,et al.  Multi-Objective Particle Swarm Optimizers: A Survey of the State-of-the-Art , 2006 .

[2]  Leandro Nunes de Castro,et al.  A Cluster-Based Feature Selection Approach , 2009, HAIS.

[3]  K. Chou,et al.  Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. , 2007, Biochemical and biophysical research communications.

[4]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[5]  Jennifer G. Dy Unsupervised Feature Selection , 2007 .

[6]  D. Andina,et al.  Feature selection using Sequential Forward Selection and classification applying Artificial Metaplasticity Neural Network , 2010, IECON 2010 - 36th Annual Conference on IEEE Industrial Electronics Society.

[7]  R. K. Ursem Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[8]  Xuan Ma,et al.  Prediction of the subcellular location of apoptosis proteins based on approximate entropy , 2009, J. Convergence Inf. Technol..

[9]  Wen-Lian Hsu,et al.  Protein subcellular localization prediction based on compartment-specific features and structure conservation , 2007, BMC Bioinformatics.

[10]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[11]  Lothar Thiele,et al.  An evolutionary algorithm for multiobjective optimization: the strength Pareto approach , 1998 .

[12]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[13]  Michael N. Vrahatis,et al.  Particle Swarm Optimization and Intelligence: Advances and Applications , 2010 .

[14]  M. A. Khanesar,et al.  A novel binary particle swarm optimization , 2007, 2007 Mediterranean Conference on Control & Automation.

[15]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[16]  T. Cover,et al.  Entropy, Relative Entropy and Mutual Information , 2001 .

[17]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[18]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[19]  Mark Voorneveld,et al.  Characterization of Pareto dominance , 2003, Oper. Res. Lett..

[20]  M. Alamgir Hossain,et al.  Multi-objective optimal chemotherapy control model for cancer treatment , 2010, Medical & Biological Engineering & Computing.

[21]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[22]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[23]  T. Asano,et al.  ENTROPY , RELATIVE ENTROPY , AND MUTUAL INFORMATION , 2008 .

[24]  Hassan Ghassemian,et al.  Maximum relevance, minimum redundancy feature extraction for hyperspectral images , 2010, 2010 18th Iranian Conference on Electrical Engineering.

[25]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[26]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[27]  Sang-Mun Chi,et al.  Prediction of protein subcellular localization by weighted gene ontology terms. , 2010, Biochemical and biophysical research communications.

[28]  Irena Cosic,et al.  Ataxin active site determination using spectral distribution of electron ion interaction potentials of amino acids , 2010, Medical & Biological Engineering & Computing.

[29]  X.-B. Zhou,et al.  Improved prediction of subcellular location for apoptosis proteins by the dual-layer support vector machine , 2008, Amino Acids.

[30]  Shyam Visweswaran,et al.  Measuring Stability of Feature Selection in Biomedical Datasets , 2009, AMIA.

[31]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[32]  Wanlei Zhou,et al.  An effective non-parametric method for globally clustering genes from expression profiles , 2007, Medical & Biological Engineering & Computing.

[33]  Kuo-Chen Chou,et al.  Predicting subcellular localization of proteins in a hybridization space , 2004, Bioinform..

[34]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[35]  Huan Liu,et al.  Feature Selection for Clustering , 2000, Encyclopedia of Database Systems.

[36]  K. Chou,et al.  Protein subcellular location prediction. , 1999, Protein engineering.

[37]  Sun-Yuan Kung,et al.  mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines , 2012, BMC Bioinformatics.

[38]  Guo-Ping Zhou,et al.  Subcellular location prediction of apoptosis proteins , 2002, Proteins.

[39]  M. Bhasin,et al.  Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search* , 2005, Journal of Biological Chemistry.

[40]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[41]  Roland Eils,et al.  Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains , 2006, BMC Bioinformatics.

[42]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[43]  Hassan Ghassemian,et al.  Maximum relevance, minimum redundancy band selection for hyperspectral images , 2011, 2011 19th Iranian Conference on Electrical Engineering.

[44]  Xiaoqi Zheng,et al.  Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation , 2012, Amino Acids.

[45]  Kuo-Chen Chou,et al.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. , 2006, Journal of proteome research.

[46]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[47]  Jenn-Kang Hwang,et al.  Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[48]  K. Chou,et al.  iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. , 2013, Molecular bioSystems.

[49]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[50]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[51]  K. Chou,et al.  iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. , 2011, Journal of theoretical biology.

[52]  Ryung S. Kim,et al.  An improved distance measure between the expression profiles linking co-expression and co-regulation in mouse , 2006, BMC Bioinformatics.

[53]  Yang Yang,et al.  A Comparative Study on Feature Extraction from Protein Sequences for Subcellular Localization Prediction , 2006, 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[54]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[55]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[56]  Mohd Saberi Mohamad,et al.  An improved binary particle swarm optimization algorithm for genes selection and classification of colon cancer data , 2008 .