The distance-profile representation and its application to detection of distantly related protein families

BackgroundDetecting homology between remotely related protein families is an important problem in computational biology since the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. Many existing approaches address this problem by measuring the similarity between proteins through sequence or structural alignment. However, these methods do not exploit collective aspects of the protein space and the computed scores are often noisy and frequently fail to recognize distantly related protein families.ResultsWe describe an algorithm that improves over the state of the art in homology detection by utilizing global information on the proximity of entities in the protein space. Our method relies on a vectorial representation of proteins and protein families and uses structure-specific association measures between proteins and template structures to form a high-dimensional feature vector for each query protein. These vectors are then processed and transformed to sparse feature vectors that are treated as statistical fingerprints of the query proteins. The new representation induces a new metric between proteins measured by the statistical difference between their corresponding probability distributions.ConclusionUsing several performance measures we show that the new tool considerably improves the performance in recognizing distant homologies compared to existing approaches such as PSIBLAST and FUGUE.

[1]  M. Sippl Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. , 1990, Journal of molecular biology.

[2]  Matthias W. Seeger,et al.  Covariance Kernels from Bayesian Generative Models , 2001, NIPS.

[3]  Andrew E. Torda,et al.  Protein sequence threading, the alignment problem, and a two-step strategy , 1999, J. Comput. Chem..

[4]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[5]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[6]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[7]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[8]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[9]  Golan Yona,et al.  Towards a Complete Map of the Protein Space Based on a Unified Sequence and Structure Analysis of All Known Proteins , 2000, ISMB.

[10]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[11]  Sagi Snir,et al.  The Homology Kernel: A Biologically Motivated Sequence Embedding into Euclidean Space , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[12]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[13]  Giorgio Valentini,et al.  Ensembles of Learning Machines , 2002, WIRN.

[14]  Anthony J. Russell,et al.  Protein sequence threading: Averaging over structures , 2002, Proteins.

[15]  Marco Cuturi,et al.  A mutual information kernel for sequences , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[16]  U. Hobohm,et al.  A sequence property approach to searching protein databases. , 1995, Journal of molecular biology.

[17]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[18]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[19]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[20]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[22]  Ran El-Yaniv,et al.  A New Nonparametric Pairwise Clustering Algorithm Based on Iterative Estimation of Distance Profiles , 2004, Machine Learning.

[23]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[24]  Golan Yona,et al.  Automatic prediction of protein domains from sequence information using a hybrid learning system , 2004, Bioinform..

[25]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[28]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[29]  Richard Chung,et al.  Protein family comparison using statistical models and predicted structural information , 2004, BMC Bioinformatics.

[30]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[31]  E A Ferrán,et al.  Self‐organized neural maps of human protein sequences , 1994, Protein science : a publication of the Protein Society.

[32]  K Karplus,et al.  Predicting protein structure using only sequence information , 1999, Proteins.

[33]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[34]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[35]  Kimmen Sjölander,et al.  A comparison of scoring functions for protein sequence profile alignment , 2004, Bioinform..

[36]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[37]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[38]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[39]  K Karplus,et al.  What is the value added by human intervention in protein structure prediction? , 2001, Proteins.

[40]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[41]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[42]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[43]  Golan Yona,et al.  The URMS-RMS Hybrid Algorithm for Fast and Sensitive Local Protein Structure Alignment , 2005, J. Comput. Biol..

[44]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[45]  M van Heel,et al.  A new family of powerful multivariate statistical sequence analysis techniques. , 1991, Journal of molecular biology.

[46]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[47]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[48]  William R. Pearson,et al.  Identifying distantly related protein sequences , 1991, Comput. Appl. Biosci..

[49]  Serafim Batzoglou,et al.  Using multiple alignments to improve seeded local alignment algorithms , 2005, Nucleic acids research.

[50]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.