Prediction of Protein Subchloroplast Locations using Random Forests

Protein subchloroplast locations are correlated with its functions. In contrast to the large amount of available protein sequences, the information of their locations and functions is less known. The experiment works for identification of protein locations and functions are costly and time consuming. The accurate prediction of protein subchloroplast locations can accelerate the study of functions of proteins in chloroplast. This study proposes a Random Forest based method, ChloroRF, to predict protein subchloroplast locations using interpretable physicochemical properties. In addition to high prediction accuracy, the ChloroRF is able to select important physicochemical properties. The important physicochemical properties are also analyzed to provide insights into the underlying mechanism. Keywords—Chloroplast, Physicochemical properties, Protein locations, Random Forests.

[1]  The UniProt Consortium,et al.  The Universal Protein Resource (UniProt) 2009 , 2008, Nucleic Acids Res..

[2]  Shinn-Ying Ho,et al.  Computational identification of ubiquitylation sites from protein sequences , 2008, BMC Bioinformatics.

[3]  Shinn-Ying Ho,et al.  Analysis of Physicochemical Properties on Prediction of R5, X4 and R5X4 HIV-1 Coreceptor Usage , 2009 .

[4]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[5]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Shinn-Ying Ho,et al.  POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties , 2007, Bioinform..

[8]  P. Ponnuswamy,et al.  Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. , 1980, Biochimica et biophysica acta.

[9]  Sabine Cornelsen,et al.  Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  R. Nussinov,et al.  Factors enhancing protein thermostability. , 2000, Protein engineering.

[11]  A. D. McLachlan,et al.  Solvation energy in protein folding and binding , 1986, Nature.

[12]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[13]  Thierry Vermat,et al.  Integral membrane proteins of the chloroplast envelope: Identification and subcellular localization of new transporters , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  K. Nishikawa,et al.  Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. , 2001, Journal of molecular biology.

[15]  Shiow-Fen Hwang,et al.  ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features , 2007, Biosyst..

[16]  D. Leister,et al.  A prediction of the size and evolutionary origin of the proteome of chloroplasts of Arabidopsis. , 2000, Trends in plant science.

[17]  J. Garin,et al.  Proteomics of the Chloroplast Envelope Membranes from Arabidopsis thaliana*S , 2003, Molecular & Cellular Proteomics.

[18]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[19]  Mark Gerstein,et al.  Information assessment on predicting protein-protein interactions , 2004, BMC Bioinformatics.

[20]  Dario Leister,et al.  Chloroplast research in the genomic age. , 2003, Trends in genetics : TIG.

[21]  Yanda Li,et al.  SubChlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm. , 2009, Journal of theoretical biology.

[22]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[23]  G. Friso,et al.  Proteomics of the Chloroplast: Systematic Identification and Targeting Analysis of Lumenal and Peripheral Thylakoid Proteins , 2000, Plant Cell.

[24]  Milton T. W. Hearn,et al.  Physicochemical Basis of Amino Acid Hydrophobicity Scales: Evaluation of Four New Scales of Amino Acid Hydrophobicity Coefficients Derived from RP-HPLC of Peptides , 1995 .

[25]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[26]  K Nishikawa,et al.  The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins , 1992, FEBS letters.

[27]  P. Argos,et al.  Structural prediction of membrane-bound proteins. , 2005, European journal of biochemistry.

[28]  C. DeLisi,et al.  Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. , 1987, Journal of molecular biology.

[29]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[30]  S. Rackovsky,et al.  Differential geometry and polymer conformation. 4. Conformational and nucleation properties of individual amino acids , 1982 .

[31]  G. Heijne,et al.  ChloroP, a neural network‐based method for predicting chloroplast transit peptides and their cleavage sites , 1999, Protein science : a publication of the Protein Society.

[32]  J. Tainer,et al.  Atomic and residue hydrophilicity in the context of folded protein structures , 1995, Proteins.

[33]  Arun Krishnan,et al.  pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties , 2005, BMC Bioinformatics.

[34]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[35]  Herrmann,et al.  Gene transfer from organelles to the nucleus: how much, what happens, and Why? , 1998, Plant Physiology.