A new encoding technique for peptide classification

Research on peptide classification problems has focused mainly on the study of different encodings and the application of several classification algorithms to achieve improved prediction accuracies. The main drawback of the literature is the lack of an extensive comparison among the available encoding methods on a wide range of classification problems. This paper addresses the fundamental issue of which peptide encoding promises the best results for machine learning classifiers. Two novel encoding methods based on physicochemical properties of the amino acids are proposed and an extensive comparison with several standard encoding methods is performed on three different classification problems (HIV-protease, recognition of T-cell epitopes and prediction of peptides that bind human leukocyte antigens). The experimental results demonstrate the effectiveness of the new encodings and show that the frequently used orthonormal encoding is inferior compared to other methods.

[1]  K. Chou,et al.  Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. , 2007, Biochemical and biophysical research communications.

[2]  Lei Huang,et al.  A SUPPORT VECTOR MACHINE APPROACH FOR PREDICTION OF T CELL EPITOPES , 2005 .

[3]  Jian Guo,et al.  A novel method for protein subcellular localization: Combining residue-couple model and SVM , 2005, APBC.

[4]  François Spertini,et al.  A synthetic malaria vaccine elicits a potent CD8+ and CD4+ T lymphocyte immune response in humans. Implications for vaccination strategies , 2001, European journal of immunology.

[5]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[6]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[7]  Loris Nanni,et al.  An enhanced subspace method for face recognition , 2006, Pattern Recognit. Lett..

[8]  Ludmila I. Kuncheva,et al.  Examining the Relationship Between Majority Vote Accuracy and Diversity in Bagging and Boosting , 2003 .

[9]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Thorsteinn S. Rögnvaldsson,et al.  Why neural networks should not be used for HIV-1 protease cleavage site prediction , 2004, Bioinform..

[11]  Jonathan Timmis,et al.  Artificial immune systems - a new computational intelligence paradigm , 2002 .

[12]  W. Atchley,et al.  Solving the protein sequence metric problem. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[13]  K. Knutson,et al.  Immunization with a HER-2/neu helper peptide vaccine generates HER-2/neu CD8 T-cell immunity in cancer patients. , 2001, The Journal of clinical investigation.

[14]  Jan Komorowski,et al.  Computational proteomics analysis of HIV‐1 protease interactome , 2007, Proteins.

[15]  V. Brusic,et al.  Neural network-based prediction of candidate T-cell epitopes , 1998, Nature Biotechnology.

[16]  Mikael Bodén,et al.  BLOMAP: An encoding of amino acids which improves signal peptide cleavage site prediction , 2005, APBC.

[17]  Shiow-Fen Hwang,et al.  ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features , 2007, Biosyst..

[18]  K C Chou,et al.  Artificial neural network model for predicting HIV protease cleavage sites in protein , 1998 .

[19]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[20]  H. Scheraga,et al.  Statistical analysis of the physical properties of the 20 naturally occurring amino acids , 1985 .

[21]  J. Skolnick,et al.  Application of an artificial neural network to predict specific class I MHC binding peptide sequences , 1998, Nature Biotechnology.

[22]  Vladimir Brusic,et al.  Prediction of promiscuous peptides that bind HLA class I molecules , 2002, Immunology and cell biology.

[23]  Mathura S Venkatarajan,et al.  New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties , 2001 .

[24]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[25]  Bhaskar D. Kulkarni,et al.  Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM , 2007, Pattern Recognit. Lett..

[26]  Loris Nanni,et al.  MppS: An ensemble of support vector machine based on multiple physicochemical properties of amino acids , 2006, Neurocomputing.

[27]  Loris Nanni,et al.  Machine learning multi-classifiers for peptide classification , 2009, Neural Computing and Applications.

[28]  J. Hammer,et al.  New methods to predict MHC-binding sequences within protein antigens. , 1995, Current opinion in immunology.

[29]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[31]  Vladimir Brusic,et al.  Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers , 2005, IDEAL.

[32]  Thorsteinn Rögnvaldsson,et al.  Bioinformatic approaches for modeling the substrate specificity of HIV-1 protease: an overview , 2007, Expert review of molecular diagnostics.

[33]  F. Azuaje Artificial Immune Systems: A New Computational Intelligence Approach , 2003 .

[34]  Padraig Cunningham,et al.  Using Diversity in Preparing Ensembles of Classifiers Based on Different Feature Subsets to Minimize Generalization Error , 2001, ECML.

[35]  Mübeccel Demirekler,et al.  An information theoretic framework for weight estimation in the combination of probabilistic classifiers for speaker identification , 2000, Speech Commun..

[36]  U. Şahin,et al.  Generation of tissue-specific and promiscuous HLA ligand databases using DNA microarrays and virtual HLA class II matrices , 1999, Nature Biotechnology.

[37]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[38]  Loris Nanni,et al.  Comparison among feature extraction methods for HIV-1 protease cleavage site prediction , 2006, Pattern Recognit..

[39]  D. Madden The three-dimensional structure of peptide-MHC complexes. , 1995, Annual review of immunology.

[40]  Yingdong Zhao,et al.  Application of support vector machines for T-cell epitopes prediction , 2003, Bioinform..

[41]  Vladimir Brusic,et al.  Neural Models for Predicting Viral Vaccine Targets , 2005, J. Bioinform. Comput. Biol..

[42]  Robert P. W. Duin,et al.  Multi-class linear feature extraction by nonlinear PCA , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[43]  Raymond J. Mooney,et al.  Constructing Diverse Classifier Ensembles using Artificial Training Examples , 2003, IJCAI.

[44]  Loris Nanni,et al.  Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization , 2008, Amino Acids.

[45]  P. Zhou,et al.  A novel descriptor of amino acids and its application in peptide QSAR. , 2008, Journal of theoretical biology.