Performance of an iterated T-HMM for homology detection

MOTIVATION Much information about new protein sequences is derived from identifying homologous proteins. Such tasks are difficult when the evolutionary relationships are distant. Some modern methods achieve better results by building a model of a set of related sequences, and then identifying new proteins that fit the model. A further advance was the development of iterative methods that refine the model as more homologs are discovered. These methods are generally limited by ad hoc methods of sequence weighting, neglect of underlying evolutionary relationships and the representation of the set with a single one-size-fits-all model. These limitations are avoided through the use of a Tree hidden Markov model (T-HMM) approach. Our previous work described how a non-iterative version of the T-HMM method could identify distant homologs with superior performance compared with other non-iterated approaches, and described how this method was particularly appropriate for being implemented as an iterative algorithm. RESULTS We describe an iterative version of the T-HMM algorithm, and evaluate its performance for the detection of distant homologs. Significant improvement over other commonly used methods is found. AVAILABILITY The software (C++, Perl) is available from the corresponding author.

[1]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[2]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[3]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[4]  M. Hasegawa,et al.  Model of amino acid substitution in proteins encoded by mitochondrial DNA , 1996, Journal of Molecular Evolution.

[5]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[6]  Kevin Karplus,et al.  A ?ex-ible search technique based on generalized profiles , 1996 .

[7]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[8]  Richard A. Goldstein,et al.  Probabilistic reconstruction of ancestral protein sequences , 1996, Journal of Molecular Evolution.

[9]  R. Neubig,et al.  Depicting a protein's two faces: GPCR classification by phylogenetic tree‐based HMMs , 2003, FEBS letters.

[10]  R. Durbin,et al.  Tree-based maximal likelihood substitution matrices and hidden Markov models , 1995, Journal of Molecular Evolution.

[11]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[12]  Patrice Koehl,et al.  ASTRAL compendium enhancements , 2002, Nucleic Acids Res..

[13]  Ziheng Yang,et al.  Phylogenetic Analysis by Maximum Likelihood (PAML) , 2002 .

[14]  William Noble Grundy,et al.  Homology Detection via Family Pairwise Search , 1998, J. Comput. Biol..

[15]  D. Lipman,et al.  Extracting protein alignment models from the sequence database. , 1997, Nucleic acids research.

[16]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[17]  G. Mitchison A Probabilistic Treatment of Phylogeny and Sequence Alignment , 1999, Journal of Molecular Evolution.

[18]  Bin Qian,et al.  Detecting distant homologs using phylogenetic tree‐based HMMs , 2003, Proteins.

[19]  M Vingron,et al.  Phylogenetic information improves homology detection , 2001, Proteins.

[20]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[21]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[22]  Amos Bairoch,et al.  The PROSITE database, its status in 1995 , 1996, Nucleic Acids Res..

[23]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.