Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier

We have compared four classifiers on the problem of predicting the cellular localization sites of proteins in yeast and E. coli. A set of sequence derived features, such as regions of high hydrophobicity, were used for each classifier. The methods compared were a structured probabilistic model specifically designed for the localization problem, the k nearest neighbors classifier, the binary decision tree classifier, and the naïve Bayes classifier. The result of tests using stratified cross validation shows the k nearest neighbors classifier to perform better than the other methods. In the case of yeast this difference was statistically significant using a cross-validated paired t test. The result is an accuracy of approximately 60% for 10 yeast classes and 86% for 8 E. coli classes. The best previously reported accuracies for these datasets were 55% and 81% respectively.

[1]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[2]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[3]  R. Larsen An introduction to mathematical statistics and its applications / Richard J. Larsen, Morris L. Marx , 1986 .

[4]  P. Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[5]  Steven L. Salzberg On Comparing Classifiers: A Critique of Current Research and Methods , 1999 .

[6]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[7]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[8]  J. Heitman,et al.  Nuclear protein localization. , 1991, Biochimica et biophysica acta.

[9]  J. Akinsanya The knowledge base. , 1981, Nursing times.

[10]  Guillermo Rodriguez,et al.  Expert system for the , 1991 .

[11]  K. Pearson,et al.  Statistical Tests , 1935, Nature.

[12]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[13]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[14]  G DietterichThomas Approximate statistical tests for comparing supervised classification learning algorithms , 1998 .

[15]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[16]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[17]  Irving John Good,et al.  The Estimation of Probabilities: An Essay on Modern Bayesian Methods , 1965 .

[18]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[19]  R. Padmanabhan,et al.  Nuclear transport of adenovirus DNA polymerase is facilitated by interaction with preterminal protein , 1988, Cell.