论文信息 - A Two-Layer Learning Architecture for Multi-Class Protein Folds Classification

A Two-Layer Learning Architecture for Multi-Class Protein Folds Classification

Classification of protein folds plays a very important role in the protein structure discovery process, especially when traditional sequence alignment methods fail to yield convincing structural homologies. In this chapter, we have developed a two-layer learning architecture, named TLLA, for multi-class protein folds classification. In the first layer, OET-KNN (Optimized Evidence-Theoretic K Nearest Neighbors) is used as the component classifier to find the most probable K-folds of the query protein. In the second layer, we use support vector machine (SVM) to build the multi-class classifier just on the K-folds, generated in the first layer, rather than on all the 27 folds. For multi-feature combination, ensemble strategy based on voting is selected to give the final classification result. The standard percentage accuracy of our method at ~63% is achieved on the independent testing dataset, where most of the proteins have <25% sequence identity with those in the training dataset. The experimental evaluation based on a widely used benchmark dataset has shown that our approach outperforms the competing methods, implying our approach might become a useful vehicle in the literature. DOI: 10.4018/978-1-4666-3604-0.ch041

Xieping Gao | Ruofei Wang

[1] Limsoon Wong,et al. Predicting Protein Functions from Protein Interaction Networks , 2012, Int. J. Knowl. Discov. Bioinform..

[2] T L Blundell,et al. FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[3] Chris H. Q. Ding,et al. Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[4] Pierre Baldi,et al. A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[5] D. T. Jones,et al. A new approach to protein fold recognition , 1992, Nature.

[6] I. Muchnik,et al. Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. , 1999, Proteins.

[7] Loris Nanni. A novel ensemble of classifiers for protein fold recognition , 2006, Neurocomputing.

[8] Thierry Denoeux,et al. A k-nearest neighbor classification rule based on Dempster-Shafer theory , 1995, IEEE Trans. Syst. Man Cybern..

[9] Efstratios F. Georgopoulos,et al. Efficient Computational Construction of Weighted Protein-Protein Interaction Networks Using Adaptive Filtering Techniques Combined with Natural Selection-Based Heuristic Algorithms , 2012 .

[10] Chuen-Der Huang,et al. Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[11] Xieping Gao,et al. A novel hierarchical ensemble classifier for protein fold recognition. , 2008, Protein engineering, design & selection : PEDS.

[12] D T Jones,et al. Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[13] M. Sternberg,et al. Enhanced genome annotation using structural profiles in the program 3D-PSSM. , 2000, Journal of molecular biology.

[14] Kuo-Chen Chou,et al. Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[15] Pierre Baldi,et al. Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[16] R. Agarwala,et al. Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches , 2006, Nucleic acids research.

[17] I. Muchnik,et al. Recognition of a protein fold in the context of the SCOP classification , 1999 .

[18] Thierry Denoeux,et al. An evidence-theoretic k-NN rule with parameter optimization , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[19] I. Muchnik,et al. Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[20] Tim J. P. Hubbard,et al. SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[21] Hongyi Zhou,et al. Single‐body residue‐level knowledge‐based energy score combined with sequence‐profile and secondary structure information for fold recognition , 2004, Proteins.