Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location*

Proteins are generally classified into the following 12 subcellular locations: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Because the function of a protein is closely correlated with its subcellular location, with the rapid increase in new protein sequences entering into databanks, it is vitally important for both basic research and pharmaceutical industry to establish a high throughput tool for predicting protein subcellular location. In this paper, a new concept, the so-called “functional domain composition” is introduced. Based on the novel concept, the representation for a protein can be defined as a vector in a high-dimensional space, where each of the clustered functional domains derived from the protein universe serves as a vector base. With such a novel representation for a protein, the support vector machine (SVM) algorithm is introduced for predicting protein subcellular location. High success rates are obtained by the self-consistency test, jackknife test, and independent dataset test, respectively. The current approach not only can play an important complementary role to the powerful covariant discriminant algorithm based on the pseudo amino acid composition representation (Chou, K. C. (2001)Proteins Struct. Funct. Genet. 43, 246–255; Correction (2001) Proteins Struct. Funct. Genet. 44, 60), but also may greatly stimulate the development of this area.

[1]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[2]  P. Wolfe A duality theorem for non-linear programming , 1961 .

[3]  K Nishikawa,et al.  The folding type of a protein is relevant to the amino acid composition. , 1986, Journal of biochemistry.

[4]  G. Fasman Prediction of Protein Structure and the Principles of Protein Conformation , 2012, Springer US.

[5]  P. Y. Chou,et al.  Prediction of Protein Structural Classes from Amino Acid Compositions , 1989 .

[6]  M. Sternberg Prediction of protein structure and the principles of protein conformation , 1990 .

[7]  C. Zhang,et al.  Predicting protein folding types by distance functions that make allowances for amino acid interactions. , 1994, The Journal of biological chemistry.

[8]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[9]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[10]  K. Chou A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition space , 1995, Proteins.

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  P. Aloy,et al.  Relation between amino acid composition and cellular location of proteins. , 1997, Journal of molecular biology.

[13]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  K. Chou,et al.  Prediction and classification of domain structural classes , 1998, Proteins.

[16]  K. Chou,et al.  Prediction of Protein Structural Classes by Modified Mahalanobis Discriminant Algorithm , 1998, Journal of protein chemistry.

[17]  Guo-Ping Zhou,et al.  An Intriguing Controversy over Protein Structural Class Prediction , 1998, Journal of protein chemistry.

[18]  K. Chou,et al.  Using discriminant function for prediction of subcellular location of prokaryotic proteins. , 1998, Biochemical and biophysical research communications.

[19]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[20]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[21]  K. Chou,et al.  Protein subcellular location prediction. , 1999, Protein engineering.

[22]  K. Chou,et al.  Prediction of membrane protein types and subcellular locations , 1999, Proteins.

[23]  K. Chou,et al.  A key driving force in determination of protein structural classes. , 1999, Biochemical and biophysical research communications.

[24]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[25]  K. Chou,et al.  Support vector machines for prediction of protein subcellular location. , 2000, Molecular cell biology research communications : MCBRC.

[26]  K. Chou,et al.  Using neural networks for prediction of subcellular location of prokaryotic and eukaryotic proteins. , 2000, Molecular cell biology research communications : MCBRC.

[27]  K. Chou,et al.  Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. , 2000, Biochemical and biophysical research communications.

[28]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[29]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[30]  K C Chou,et al.  Prediction of protein structural classes and subcellular locations. , 2000, Current protein & peptide science.

[31]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[32]  G P Zhou,et al.  Some insights into protein structural class prediction , 2001, Proteins.

[33]  Yu-Dong Cai,et al.  Is it a paradox or misinterpretation? , 2001, Proteins.

[34]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[35]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[36]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[37]  K. Chou,et al.  Support vector machines for prediction of protein subcellular location by incorporating quasi‐sequence‐order effect , 2002, Journal of cellular biochemistry.