A Novel Ensemble Technique for Protein Subcellular Location Prediction

In this chapter we present an ensemble classifier that performs multi-class classification by combining several kernel classifiers through Decision Direct Acyclic Graph (DDAG). Each base classifier, called K-TIPCAC, is mainly based on the projection of the given points on the Fisher subspace, estimated on the training data, by means of a novel technique. The proposed multiclass classifier is applied to the task of protein subcellular location prediction, which is one of the most difficult multiclass prediction problems in modern computational biology. Although many methods have been proposed in the literature to solve this problem all the existing approaches are affected by some limitations, so that the problem is still open. Experimental results clearly indicate that the proposed technique, called DDAG K-TIPCAC, performs equally, if not better, than state of the art ensemble methods aimed at multi-class classification of highly unbalanced data.

[1]  Thierry Denoeux,et al.  An evidence-theoretic k-NN rule with parameter optimization , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[2]  M. Bhasin,et al.  Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search* , 2005, Journal of Biological Chemistry.

[3]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[6]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[7]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[8]  K. Chou,et al.  Protein subcellular location prediction. , 1999, Protein engineering.

[9]  Alessandro Rozza,et al.  PIPCAC: A Novel Binary Classifier Assuming Mixtures of Gaussian Functions , 2010 .

[10]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[11]  Oliver Kohlbacher,et al.  Going from where to why—interpretable prediction of protein subcellular localization , 2010, Bioinform..

[12]  J. Fox Applied Regression Analysis, Linear Models, and Related Methods , 1997 .

[13]  Thierry Denoeux,et al.  A k-nearest neighbor classification rule based on Dempster-Shafer theory , 1995, IEEE Trans. Syst. Man Cybern..

[14]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[15]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[16]  Kuo-Chen Chou,et al.  Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. , 2003, Biochemical and biophysical research communications.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Kuo-Chen Chou,et al.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. , 2006, Journal of proteome research.

[19]  Santosh S. Vempala,et al.  Isotropic PCA and Affine-Invariant Clustering , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[20]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[21]  K. Chou,et al.  Prediction of protein subcellular locations by GO-FunD-PseAA predictor. , 2004, Biochemical and biophysical research communications.

[22]  K. Chou,et al.  Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. , 2007, Biochemical and biophysical research communications.

[23]  Yang Dai,et al.  An SVM-based system for predicting protein subnuclear localizations , 2005, BMC Bioinformatics.

[24]  Alessandro Rozza,et al.  Novel IPCA-Based Classifiers and Their Application to Spam Filtering , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[25]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[26]  K. Chou,et al.  Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. , 2007, Biopolymers.

[27]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[28]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[29]  Ying Huang,et al.  Prediction of protein subcellular locations using fuzzy k-NN method , 2004, Bioinform..

[30]  Stefan Kramer,et al.  Ensembles of nested dichotomies for multi-class problems , 2004, ICML.

[31]  Kuo-Chen Chou,et al.  A New Method for Predicting the Subcellular Localization of Eukaryotic Proteins with Both Single and Multiple Sites: Euk-mPLoc 2.0 , 2010, PloS one.

[32]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[33]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[34]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[35]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[36]  Gajendra P. S. Raghava,et al.  PSLpred: prediction of subcellular localization of bacterial proteins , 2005, Bioinform..

[37]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[38]  K. Chou A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition space , 1995, Proteins.

[39]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[40]  Alessandro Rozza,et al.  O-IPCAC and its Application to EEG Classification , 2010, WAPA.

[41]  P. Hansen The truncatedSVD as a method for regularization , 1987 .