A hidden Markov model for progressive multiple alignment

MOTIVATION Progressive algorithms are widely used heuristics for the production of alignments among multiple nucleic-acid or protein sequences. Probabilistic approaches providing measures of global and/or local reliability of individual solutions would constitute valuable developments. RESULTS We present here a new method for multiple sequence alignment that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process. Our method works by iterating pairwise alignments according to a guide tree and defining each ancestral sequence from the pairwise alignment of its child nodes, thus, progressively constructing a multiple alignment. Our method allows for the computation of each column minimum posterior probability and we show that this value correlates with the correctness of the result, hence, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment.

[1]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[2]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[3]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[4]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[5]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[6]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[7]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[8]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[9]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[10]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[11]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[12]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[13]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[14]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[15]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[16]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[18]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[19]  J. Hein,et al.  Statistical alignment: computational properties, homology testing and goodness-of-fit. , 2000, Journal of molecular biology.

[20]  Ari Löytynoja,et al.  SOAP, cleaning multiple alignments from unstable blocks , 2001, Bioinform..

[21]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[22]  M Vingron,et al.  Near-optimal sequence alignment. , 1996, Current opinion in structural biology.

[23]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..