BS-KNN: An Effective Algorithm for Predicting Protein Subchloroplast Localization

Chloroplasts are organelles found in cells of green plants and eukaryotic algae that conduct photosynthesis. Knowing a protein's subchloroplast location provides in-depth insights about the protein's function and the microenvironment where it interacts with other molecules. In this paper, we present BS-KNN, a bit-score weighted K-nearest neighbor method for predicting proteins' subchloroplast locations. The method makes predictions based on the bit-score weighted Euclidean distance calculated from the composition of selected pseudo-amino acids. Our method achieved 76.4% overall accuracy in assigning proteins to 4 subchloroplast locations in cross-validation. When tested on an independent set that was not seen by the method during the training and feature selection, the method achieved a consistent overall accuracy of 76.0%. The method was also applied to predict subchloroplast locations of proteins in the chloroplast proteome and validated against proteins in Arabidopsis thaliana. The software and datasets of the proposed method are available at https://edisk.fandm.edu/jing.hu/bsknn/bsknn.html.

[1]  Li Zhang,et al.  Identify submitochondria and subchloroplast locations with pseudo amino acid composition: approach from the strategy of discrete wavelet transform feature extraction. , 2011, Biochimica et biophysica acta.

[2]  Kuo-Chen Chou,et al.  Prediction and classification of protein subcellular location—sequence‐order effect and pseudo amino acid composition , 2003, Journal of cellular biochemistry.

[3]  Shinn-Ying Ho,et al.  Prediction of Protein Subchloroplast Locations using Random Forests , 2010 .

[4]  Loris Nanni,et al.  Particle swarm optimization for ensembling generation for evidential k-nearest-neighbour classifier , 2009, Neural Computing and Applications.

[5]  Yanda Li,et al.  SubChlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm. , 2009, Journal of theoretical biology.

[6]  The UniProt Consortium,et al.  The Universal Protein Resource (UniProt) 2009 , 2008, Nucleic Acids Res..

[7]  Xieping Gao,et al.  A novel hierarchical ensemble classifier for protein fold recognition. , 2008, Protein engineering, design & selection : PEDS.

[8]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[9]  Thomas Martinetz,et al.  Prediction of apoptosis protein locations with genetic algorithms and support vector machines through a new mode of pseudo amino acid composition. , 2010, Protein and peptide letters.

[10]  Peter Roepstorff,et al.  Central Functions of the Lumenal and Peripheral Thylakoid Proteome of Arabidopsis Determined by Experimentation and Genome-Wide Prediction Article, publication date, and citation information can be found at www.plantcell.org/cgi/doi/10.1105/tpc.010304. , 2002, The Plant Cell Online.

[11]  G. Friso,et al.  Proteomics of the Chloroplast: Systematic Identification and Targeting Analysis of Lumenal and Peripheral Thylakoid Proteins , 2000, Plant Cell.

[12]  W. Gruissem,et al.  plprot: a comprehensive proteome database for different plastid types. , 2006, Plant & cell physiology.

[13]  D. Leister,et al.  A prediction of the size and evolutionary origin of the proteome of chloroplasts of Arabidopsis. , 2000, Trends in plant science.

[14]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[15]  Sabine Cornelsen,et al.  Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[17]  G. Heijne,et al.  ChloroP, a neural network‐based method for predicting chloroplast transit peptides and their cleavage sites , 1999, Protein science : a publication of the Protein Society.

[18]  J. Garin,et al.  Proteomics of the Chloroplast Envelope Membranes from Arabidopsis thaliana*S , 2003, Molecular & Cellular Proteomics.

[19]  Dario Leister,et al.  Chloroplast research in the genomic age. , 2003, Trends in genetics : TIG.

[20]  Thierry Vermat,et al.  Integral membrane proteins of the chloroplast envelope: Identification and subcellular localization of new transporters , 2002, Proceedings of the National Academy of Sciences of the United States of America.