Protein Remote Homology Detection and Fold Recognition based on Features Extracted from Frequency Profiles

Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. The performance of SVM depends on the method of protein vectorization, so a suitable representation of the protein sequence is a key step for the SVM-based methods. In this paper, two kinds of profile-level building blocks of proteins, binary profiles and N-nary profiles, have been presented, which contain the evolutionary information of the protein sequence frequency profile. The protein sequence frequency profiles calculated from the multiple sequence alignments outputted by PSI-BLAST are converted into binary profiles or N-nary profiles. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each binary profile or N-nary profile and then the corresponding vectors are inputted to support vector machines. The latent semantic analysis (LSA) model, an efficient feature extraction algorithm, is adopted to further improve the performance of our methods. Experiments with protein remote homology detection and fold recognition show that the methods based on profile-level building blocks give better results compared to related methods.

[1]  Xiaolong Wang,et al.  Sequence analysis Application of latent semantic analysis to protein remote homology detection , 2006 .

[2]  Mong-Li Lee,et al.  Efficient remote homology detection using local structure , 2003, Bioinform..

[3]  Tatsuya Akutsu,et al.  Comparison of SVM-Based Methods for Remote Homology Detection , 2002 .

[4]  Richard A. Goldstein,et al.  Performance of an iterated T-HMM for homology detection , 2004, Bioinform..

[5]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[6]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[7]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[8]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[9]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[10]  Scot E. Dowd,et al.  Windows .NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST) , 2005, BMC Bioinformatics.

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[13]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[14]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[15]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[16]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[17]  Hasan Ogul,et al.  A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets , 2007, Biosyst..

[18]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[19]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[20]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[21]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[22]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.

[23]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[24]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[25]  N. Balakrishnan,et al.  Characterization of protein secondary structure , 2004, IEEE Signal Processing Magazine.

[26]  Judith Klein-Seetharaman,et al.  Application of latent semantic analysis using different vocabularies , .

[27]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[28]  William Noble Grundy,et al.  Classifying proteins by family using the product of correlated p-values , 1999, RECOMB.

[29]  Lei Lin,et al.  A pattern-based SVM for protein remote homology detection , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[30]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[31]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[32]  Tony Håndstad,et al.  Motif kernel generated by genetic programming improves remote homology and fold detection , 2007, BMC Bioinformatics.

[33]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.