Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection

Protein remote homology detection is one of the most important problems in bioinformatics. Discriminative methods such as support vector machines (SVM) have shown superior performance. However, the performance of SVM-based methods depends on the vector representations of the protein sequences. Prior works have demonstrated that sequence-order effects are relevant for discrimination, but little work has explored how to incorporate the sequence-order information along with the amino acid physicochemical properties into the prediction. In order to incorporate the sequence-order effects into the protein remote homology detection, the physicochemical distance transformation (PDT) method is proposed. Each protein sequence is converted into a series of numbers by using the physicochemical property scores in the amino acid index (AAIndex), and then the sequence is converted into a fixed length vector by PDT. The sequence-order information can be efficiently included into the feature vector with little computational cost by this approach. Finally, the feature vectors are input into a support vector machine classifier to detect the protein remote homologies. Our experiments on a well-known benchmark show the proposed method SVM-PDT achieves superior or comparable performance with current state-of-the-art methods and its computational cost is considerably superior to those of other methods. When the evolutionary information extracted from the frequency profiles is combined with the PDT method, the profile-based PDT approach can improve the performance by 3.4% and 11.4% in terms of ROC score and ROC50 score respectively. The local sequence-order information of the protein can be efficiently captured by the proposed PDT and the physicochemical properties extracted from the amino acid index are incorporated into the prediction. The physicochemical distance transformation provides a general framework, which would be a valuable tool for protein-level study.

[1]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[2]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[3]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[4]  D. Brutlag,et al.  Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[7]  I. Grigoriev,et al.  Detection of protein fold similarity based on correlation of amino acid properties. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[9]  X M Pan,et al.  Accurate Prediction of Protein Secondary Structural Content , 2001, Journal of protein chemistry.

[10]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[11]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[12]  Mong-Li Lee,et al.  Efficient remote homology detection using local structure , 2003, Bioinform..

[13]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.

[14]  Yutaka Akiyama,et al.  FORTE: a profile-profile comparison tool for protein fold recognition , 2004, Bioinform..

[15]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[16]  M. Wang,et al.  Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. , 2004, Protein engineering, design & selection : PEDS.

[17]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[18]  Richard A. Goldstein,et al.  Performance of an iterated T-HMM for homology detection , 2004, Bioinform..

[19]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[20]  Wynne Hsu,et al.  Remote homolog detection using local sequence–structure correlations , 2004, Proteins.

[21]  William Stafford Noble,et al.  Identifying remote protein homologs by network propagation , 2005, The FEBS journal.

[22]  Christopher S. Oehmen,et al.  SVM-BALSA: Remote homology detection based on Bayesian sequence alignment , 2005, Comput. Biol. Chem..

[23]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[24]  Lei Lin,et al.  A pattern-based SVM for protein remote homology detection , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[25]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[26]  Scott Dick,et al.  Classifier ensembles for protein structural class prediction with varying homology. , 2006, Biochemical and biophysical research communications.

[27]  Xiaolong Wang,et al.  Sequence analysis Application of latent semantic analysis to protein remote homology detection , 2006 .

[28]  Tony Håndstad,et al.  Motif kernel generated by genetic programming improves remote homology and fold detection , 2007, BMC Bioinformatics.

[29]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[30]  Klaus Obermayer,et al.  Fast model-based protein homology detection without alignment , 2007, Bioinform..

[31]  Lukasz Kurgan,et al.  Prediction of protein structural class for the twilight zone sequences. , 2007, Biochemical and biophysical research communications.

[32]  Hasan Ogul,et al.  A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets , 2007, Biosyst..

[33]  Hasan Ogul,et al.  Erratum to "A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets" [BioSystems 87 (2007) 75-81] , 2007, Biosyst..

[34]  Ke Chen,et al.  Prediction of protein secondary structure content for the twilight zone sequences , 2007, Proteins.

[35]  Christopher S. Oehmen,et al.  SVM-HUSTLE - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection , 2008, Bioinform..

[36]  Xiaolong Wang,et al.  A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis , 2008, BMC Bioinformatics.

[37]  R. Bundschuh,et al.  Distant homology detection using a LEngth and STructure‐based sequence Alignment Tool (LESTAT) , 2007, Proteins.

[38]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[39]  Kuo-Bin Li,et al.  Remote protein homology detection using recurrence quantification analysis and amino acid physicochemical properties. , 2008, Journal of theoretical biology.

[40]  Peter Meinicke,et al.  Word correlation matrices for protein sequence analysis and remote homology detection , 2008, BMC Bioinformatics.

[41]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[42]  David T. Jones,et al.  pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination , 2009, Bioinform..

[43]  Ming Tang,et al.  COMPASS server for homology detection: improved statistical accuracy, speed and functionality , 2009, Nucleic Acids Res..

[44]  Christopher S. Oehmen,et al.  Physicochemical property distributions for accurate and rapid pairwise protein homology detection , 2010, BMC Bioinformatics.

[45]  M. Sternberg,et al.  Protein structure prediction on the Web: a case study using the Phyre server , 2009, Nature Protocols.

[46]  Jaap Heringa,et al.  webPRC: the Profile Comparer for alignment-based searching of public domain databases , 2009, Nucleic Acids Res..

[47]  Alessandra Carbone,et al.  A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models , 2011, BMC Bioinformatics.

[48]  Mindaugas Margelevicius,et al.  COMA server for protein distant homology search , 2010, Bioinform..

[49]  Evgeni Tsivtsivadze,et al.  Efficient remote homology detection , 2010, PRIB 2010.

[50]  Xuan Liu,et al.  Protein remote homology detection based on auto-cross covariance transformation , 2011, Comput. Biol. Medicine.

[51]  Christoph Weber,et al.  FFAS server: novel features and applications , 2011, Nucleic Acids Res..