Statistical alignment: computational properties, homology testing and goodness-of-fit.

The model of insertions and deletions in biological sequences, first formulated by Thorne, Kishino, and Felsenstein in 1991 (the TKF91 model), provides a basis for performing alignment within a statistical framework. Here we investigate this model.Firstly, we show how to accelerate the statistical alignment algorithms several orders of magnitude. The main innovations are to confine likelihood calculations to a band close to the similarity based alignment, to get good initial guesses of the evolutionary parameters and to apply an efficient numerical optimisation algorithm for finding the maximum likelihood estimate. In addition, the recursions originally presented by Thorne, Kishino and Felsenstein can be simplified. Two proteins, about 1500 amino acids long, can be analysed with this method in less than five seconds on a fast desktop computer, which makes this method practical for actual data analysis.Secondly, we propose a new homology test based on this model, where homology means that an ancestor to a sequence pair can be found finitely far back in time. This test has statistical advantages relative to the traditional shuffle test for proteins.Finally, we describe a goodness-of-fit test, that allows testing the proposed insertion-deletion (indel) process inherent to this model and find that real sequences (here globins) probably experience indels longer than one, contrary to what is assumed by the model.

[1]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[2]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[3]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[4]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[5]  R. Doolittle Of urfs and orfs : a primer on how to analyze devised amino acid sequences , 1986 .

[6]  M. Bishop,et al.  Maximum likelihood alignment of DNA sequences. , 1986, Journal of molecular biology.

[7]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[8]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[9]  William H. Press,et al.  Numerical recipes in C (2nd ed.): the art of scientific computing , 1992 .

[10]  Philipp Bucher,et al.  A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System , 1996, ISMB.

[11]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[12]  Jun Zhu,et al.  Bayesian adaptive sequence alignment algorithms , 1998, Bioinform..

[13]  G. Mitchison A Probabilistic Treatment of Phylogeny and Sequence Alignment , 1999, Journal of Molecular Evolution.

[14]  N. Reid,et al.  Likelihood , 1993 .