论文信息 - A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets

A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets

In this study, n-peptide compositions are utilized for protein vectorization over a discriminative remote homology detection framework based on support vector machines (SVMs). The size of amino acid alphabet is gradually reduced for increasing values of n to make the method to conform with the memory resources in conventional workstations. A hash structure is implemented for accelerated search of n-peptides. The method is tested to see its ability to classify proteins into families on a subset of SCOP family database and compared against many of the existing homology detection methods including the most popular generative methods; SAM-98 and PSI-BLAST and the recent SVM methods; SVM-Fisher, SVM-BLAST and SVM-Pairwise. The results have demonstrated that the new method significantly outperforms SVM-Fisher, SVM-BLAST, SAM-98 and PSI-BLAST, while achieving a comparable accuracy with SVM-Pairwise. In terms of efficiency, it performs much better than SVM-Pairwise. It is shown that the information of n-peptide compositions with reduced amino acid alphabets provides an accurate and efficient means of protein vectorization for SVM-based sequence classification.

Hasan Ogul | Ü. Erkan Mumcuoglu | H. Oğul | Ü. Mumcuoglu

[1] R. Jernigan,et al. Understanding the recognition of protein structural classes by amino acid composition , 1997, Proteins.

[2] Richard Hughey,et al. Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[3] Gajendra P. S. Raghava,et al. PSLpred: prediction of subcellular localization of bacterial proteins , 2005, Bioinform..

[4] Jenn-Kang Hwang,et al. Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[5] A G Murzin,et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[6] K. Chou,et al. Predicting protein structural classes from amino acid composition: application of fuzzy clustering. , 1995, Protein engineering.

[7] R. Levy,et al. Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[8] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[9] T. N. Bhat,et al. The Protein Data Bank , 2000, Nucleic Acids Res..

[10] B. Rost. Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[11] Michael Gribskov,et al. Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[12] Jun Wang,et al. Reduction of protein sequence complexity by residue grouping. , 2003, Protein engineering.

[13] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[14] David Haussler,et al. A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[15] Thomas L. Madden,et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[16] Li Liao,et al. Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[17] D. Haussler,et al. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[18] Wei-Mou Zheng,et al. Simplified amino acid alphabets based on deviation of conditional probability from random background. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.