Sequence Features and Subset Selection Technique for the Prediction of Protein Trafficking Phenomenon in Eukaryotic Non Membrane Proteins

Protein trafficking or protein sorting is the mechanism by which a cell transports proteins to the appropriate position in the cell or outside of it. This targeting is based on the information contained in the protein. Many methods predict the subcellular location of proteins in eukaryotes from the sequence information. However, most of these methods use a flat structure to perform prediction. In this work, we introduce ensemble methods to predict locations in the eukaryotic protein-sorting non membrane pathway hierarchically. We used features that were extracted exclusively from full length protein sequences with feature subset selection for classification. Sequence driven features, sequence mapped features and sequence autocorrelation features were tested with ensemble learners and classifier performances were compared with and without feature subset selection technique. This study shows the new features extracted from full length eukaryotic protein sequences are effective at capturing biological features among compartments in eukaryotic non membrane pathways at two levels. Feature subset selection techniques helped to reduce the time taken for building the classification model.

[1]  Ziv Bar-Joseph,et al.  Ieee/acm Transactions on Computational Biology and Bioinformatics Discriminative Motif Finding for Predicting Protein Subcellular Localization , 2022 .

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[4]  K. Nakai,et al.  Review: prediction of in vivo fates of proteins in the era of genomics and proteomics. , 2001, Journal of structural biology.

[5]  Colin Loftin,et al.  Spatial Autocorrelation Models for Galton's Problem , 1981 .

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Kent A. Spackman,et al.  Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning , 1989, ML.

[8]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[9]  Yu Zong Chen,et al.  Support vector machines approach for predicting druggable proteins: recent progress in its exploration and investigation of its usefulness. , 2007, Drug discovery today.

[10]  D A Kendall,et al.  Protein transport via amino-terminal targeting sequences: common themes in diverse systems. , 1995, Molecular membrane biology.

[11]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[12]  B. Rost,et al.  Mimicking cellular sorting improves prediction of subcellular localization. , 2005, Journal of molecular biology.

[13]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[14]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[15]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[16]  H. Lodish Molecular Cell Biology , 1986 .

[17]  Piero Fariselli,et al.  eSLDB: eukaryotic subcellular localization database , 2006, Nucleic Acids Res..

[18]  Kuo-Chen Chou,et al.  Prediction and classification of protein subcellular location—sequence‐order effect and pseudo amino acid composition , 2003, Journal of cellular biochemistry.

[19]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[20]  Burkhard Rost,et al.  Supporting online material for : LocTree 2 predicts localization for all domains of life , 2012 .

[21]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[22]  Oliver Kohlbacher,et al.  MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction , 2009, BMC Bioinformatics.

[23]  K. Nishikawa,et al.  Classification of proteins into groups based on amino acid composition and other characters. II. Grouping into four types. , 1983, Journal of biochemistry.

[24]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[25]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[26]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[27]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[28]  D. Altman,et al.  Statistics Notes: Diagnostic tests 2: predictive values , 1994, BMJ.

[29]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[30]  G. Cooper The Cell: A Molecular Approach , 1996 .

[31]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[32]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[33]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[34]  W. Tobler A Computer Movie Simulating Urban Growth in the Detroit Region , 1970 .

[35]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[36]  Achuthsankar S Nair,et al.  Protein location prediction using atomic composition and global features of the amino acid sequence. , 2010, Biochemical and biophysical research communications.

[37]  Piero Fariselli,et al.  BaCelLo: a balanced subcellular localization predictor , 2006, ISMB.

[38]  P Bork,et al.  Wanted: subcellular localization of proteins based on sequence. , 1998, Trends in cell biology.

[39]  Kuo-Bin Li,et al.  AAIndexLoc: predicting subcellular localization of proteins based on a new representation of sequences using amino acid indices , 2008, Amino Acids.

[40]  Burkhard Rost,et al.  UniqueProt: creating representative protein sequence sets , 2003, Nucleic Acids Res..

[41]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[42]  D. Horne,et al.  Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities , 1988, Biopolymers.

[43]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[44]  Wen-Lian Hsu,et al.  Protein subcellular localization prediction based on compartment-specific features and structure conservation , 2007, BMC Bioinformatics.

[45]  H. Lodish,et al.  Molecular Cell Biology: An Integrated View of Cells at Work , 2000 .

[46]  R. Sokal,et al.  Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population. , 2006, American journal of physical anthropology.

[47]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[48]  C. Zhang,et al.  Prediction of Membrane Protein Types Based on the Hydrophobic Index of Amino Acids , 2000, Journal of protein chemistry.

[49]  Yu Zong Chen,et al.  Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. , 2004, RNA.

[50]  Yujie Cai,et al.  The influence of dipeptide composition on protein thermostability , 2004, FEBS letters.

[51]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[52]  Lianyi Han,et al.  Prediction of the functional class of metal-binding proteins from sequence derived physicochemical properties by support vector machine approach , 2006, BMC Bioinformatics.

[53]  X M Pan,et al.  Accurate Prediction of Protein Secondary Structural Content , 2001, Journal of protein chemistry.

[54]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[55]  M. Gerstein,et al.  A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. , 2000, Journal of molecular biology.

[56]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[57]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.