Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA

An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.

[1]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[2]  Cheng Li,et al.  Fisher Linear Discriminant Analysis , 2014 .

[3]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[4]  Runtao Yang,et al.  An Effective Antifreeze Protein Predictor with Ensemble Classifiers and Comprehensive Sequence Descriptors , 2015, International journal of molecular sciences.

[5]  Bandana Kumari,et al.  Protein Sub-Nuclear Localization Prediction Using SVM and Pfam Domain Information , 2014, PloS one.

[6]  K. Chou,et al.  iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. , 2011, Journal of theoretical biology.

[7]  Yang Dai,et al.  An SVM-based system for predicting protein subnuclear localizations , 2005, BMC Bioinformatics.

[8]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[9]  Kuo-Chen Chou,et al.  Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition. , 2005, Biochemical and biophysical research communications.

[10]  J. Nieto,et al.  Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition. , 2009, Journal of theoretical biology.

[11]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Chun Yan,et al.  Prediction of protein subcellular location using a combined feature of sequence , 2005, FEBS letters.

[13]  Yong Zhou,et al.  Prediction of Protein–Protein Interactions with Clustered Amino Acids and Weighted Sparse Representation , 2015, International journal of molecular sciences.

[14]  Qiao Shan-pin,et al.  Review of protein subcellular localization prediction , 2014 .

[15]  Fatos Xhafa,et al.  A Genetic Algorithm for Ground Station Scheduling , 2011, 2011 International Conference on Complex, Intelligent, and Software Intensive Systems.

[16]  Xiaoqi Zheng,et al.  Prediction of bacterial protein subcellular localization by incorporating various features into Chou's PseAAC and a backward feature selection approach. , 2014, Biochimie.

[17]  T E Karakasidis,et al.  Fuzzy polynucleotide spaces and metrics , 2006, Bulletin of mathematical biology.

[18]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[19]  F.-M. Li,et al.  Using pseudo amino acid composition to predict protein subnuclear location with improved hybrid approach , 2007, Amino Acids.

[20]  Yu-Chu Tian,et al.  An Ensemble Method for Predicting Subnuclear Localizations from Primary Protein Structures , 2013, PloS one.

[21]  Dimitris N. Georgiou,et al.  A Short Survey on Genetic Sequences, Chou’s Pseudo Amino Acid Composition and its Combination with Fuzzy Set Theory , 2013 .

[22]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[23]  K. Chou,et al.  iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. , 2013, Molecular bioSystems.

[24]  Bo Yang,et al.  Linear dimensionality reduction based on Hybrid structure preserving projections , 2016, Neurocomputing.

[25]  Ujjwal Maulik,et al.  Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC , 2015, Medical & Biological Engineering & Computing.

[26]  Paul D. Gader,et al.  Robust kernel discriminant analysis using fuzzy memberships , 2011, Pattern Recognit..

[27]  Wen Zhu,et al.  Learning a Weighted Meta-Sample Based Parameter Free Sparse Representation Classification for Microarray Data , 2014, PloS one.

[28]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[29]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[30]  Wang Fei,et al.  Amino acid classification based spectrum kernel fusion for protein subnuclear localization , 2010, BMC Bioinformatics.

[31]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Wei Chen,et al.  Predicting peroxidase subcellular location by hybridizing different descriptors of Chou' pseudo amino acid patterns. , 2014, Analytical biochemistry.

[33]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[34]  Heinz Mühlenbein,et al.  Parallel Genetic Algorithms, Population Genetics, and Combinatorial Optimization , 1989, Parallelism, Learning, Evolution.

[35]  T E Karakasidis,et al.  A study of entropy/clarity of genetic sequences using metric spaces and fuzzy sets. , 2010, Journal of theoretical biology.

[36]  Martin Ester,et al.  PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes , 2010, Bioinform..

[37]  B. Liu,et al.  Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach , 2015, PloS one.

[38]  Kuo-Chen Chou,et al.  iNR-Drug: Predicting the Interaction of Drugs with Nuclear Receptors in Cellular Networking , 2014, International journal of molecular sciences.

[39]  Jie Yang,et al.  Predicting subcellular localization of gram-negative bacterial proteins by linear dimensionality reduction method. , 2010, Protein and peptide letters.

[40]  安藤 寛,et al.  Cross-Validation , 1952, Encyclopedia of Machine Learning and Data Mining.

[41]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[42]  Bhaskar D. Kulkarni,et al.  Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM , 2007, Pattern Recognit. Lett..

[43]  Hassan Mohabatkar,et al.  Prediction of allergenic proteins by means of the concept of Chou's pseudo amino acid composition and a machine learning approach. , 2012, Medicinal chemistry (Shariqah (United Arab Emirates)).

[44]  Kuo-Bin Li,et al.  Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition. , 2013, Journal of theoretical biology.

[45]  Tongliang Zhang,et al.  Using Chou’s pseudo amino acid composition based on approximate entropy and an ensemble of AdaBoost classifiers to predict protein subnuclear location , 2008, Amino Acids.

[46]  Tong Wang,et al.  Using the nonlinear dimensionality reduction method for the prediction of subcellular localization of Gram-negative bacterial proteins , 2009, Molecular Diversity.

[47]  Kuo-Chen Chou,et al.  Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. , 2007, Protein engineering, design & selection : PEDS.

[48]  Yujie Cai,et al.  The influence of dipeptide composition on protein thermostability , 2004, FEBS letters.

[49]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..