Hadamard conjugations and modeling sequence evolution with unequal rates across sites.

This paper considers the many different distributions that may approximate the distribution of site rates in DNA sequences and shows how the Hadamard conjugation may be modified to take these into account. This is done for both 2-state and 4-state data. Distributions which give simple closed forms include the gamma (gamma) distribution, the inverse Gaussian distribution (which is similar to the lognormal), and a mixture of either of these with a proportion of sites which cannot change (invariant sites). It is seen that the tail of a distribution can have major effects upon the coefficient of variation of site rates. Because the Hadamard conjugation can be used to either correct data or predict the data given the model (i.e., the likelihood of site patterns), light is shed on properties of maximum likelihood tree selection with unequal site rates. Analysis of rRNA shows how unequal rates across sites can change the optimal tree. Maximum likelihood analysis also shows that distinct distributions fit each data set, with the gamma often not being the best. Analyzing both these data and a long stretch of primate mtDNA reveals evidence of many "hidden" multiple substitutions, while signals not corresponding to the preferred biological tree generally decrease an unequal rates are allowed for. Last, we discuss the expected behavior of sequences evolving by models where stabilizing selection alone explains unequal site rates. Such models do not explain "synapomorphies" or informative changes in ancient molecules, because while stabilizing selection can vastly decrease change at a site, it will also vastly accelerate back-substitution (leaving only a covarion model to explain old synapomorphies). When and why models allowing a continuous distribution of site rates (e.g., gamma) will approximate covarion evolution requires further study.

[1]  J. Lake,et al.  Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Peter J. Waddell,et al.  Statistical methods of phylogenetic analysis : including Hadamard conjugations, LogDet transforms and maximum likelihood : a thesis presented in partial fulfilment of the requirements for the degree of Ph.D. in Biology at Massey University , 1995 .

[3]  D. Penny,et al.  Spectral analysis of phylogenetic data , 1993 .

[4]  M. Steel,et al.  Recovering evolutionary trees under a more realistic model of sequence evolution. , 1994, Molecular biology and evolution.

[5]  M. Hasegawa,et al.  Tempo and mode of synonymous substitutions in mitochondrial DNA of primates. , 1996, Molecular biology and evolution.

[6]  J. Felsenstein Phylogenies from molecular sequences: inference and reliability. , 1988, Annual review of genetics.

[7]  T. Cavalier-smith,et al.  Kingdom protozoa and its 18 phyla. , 1993, Microbiological reviews.

[8]  Michael D. Hendy,et al.  A combinatorial description of the closest tree algorithm for finding evolutionary trees , 1991, Discret. Math..

[9]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[10]  László A. Székely,et al.  SPECTRAL ANALYSIS AND A CLOSEST TREE METHOD FOR GENETIC SEQUENCES , 1992 .

[11]  G A Churchill,et al.  Sample size for a phylogenetic inference. , 1992, Molecular biology and evolution.

[12]  D Penny,et al.  Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[13]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[14]  J. Lake,et al.  Eocytes: a new ribosome structure indicates a kingdom with a close relationship to eukaryotes. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[15]  M. Hasegawa,et al.  Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor-joining methods for estimating protein phylogeny. , 1993, Molecular phylogenetics and evolution.

[16]  Terence P. Speed,et al.  Invariants of Some Probability Models Used in Phylogenetic Inference , 1993 .

[17]  Thomas Uzzell,et al.  Fitting Discrete Probability Distributions to Evolutionary Events , 1971, Science.

[18]  James A. Lake,et al.  Origin of the eukaryotic nucleus determined by rate-invariant analysis of rRNA sequences , 1988, Nature.

[19]  László A. Székely,et al.  A complete family of phylogenetic invariants for any number of taxa under Kimura's 3ST model , 1993 .

[20]  A. Dress,et al.  Split decomposition: a new and useful approach to phylogenetic analysis of distance data. , 1992, Molecular phylogenetics and evolution.

[21]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[22]  D Penny,et al.  A discrete Fourier analysis for evolutionary trees. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Michael D. Hendy,et al.  A Framework for the Quantitative Study of Evolutionary Trees , 1989 .

[24]  P. Lewis,et al.  Success of maximum likelihood phylogeny inference in the four-taxon case. , 1995, Molecular biology and evolution.

[25]  László A. Székely,et al.  Fourier Calculus on Evolutionary Trees , 1993 .

[26]  C. R. Peters,et al.  Handbook of Human Symbolic Evolution , 1998 .

[27]  Michael D. Hendy,et al.  Parsimony Can Be Consistent , 1993 .

[28]  G. Olsen,et al.  Earliest phylogenetic branchings: comparing rRNA-based evolutionary trees inferred with various techniques. , 1987, Cold Spring Harbor symposia on quantitative biology.

[29]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[30]  M. A. STEEL,et al.  Loss of information in genetic distances , 1988, Nature.

[31]  G. B. Golding,et al.  Estimates of DNA and protein sequence divergence: an examination of some assumptions. , 1983, Molecular biology and evolution.

[32]  D. Penny Towards a basis for classification: the incompleteness of distance measures, incompatibility analysis and phenetic classification. , 1982, Journal of theoretical biology.

[33]  Michael D. Hendy,et al.  The sampling distributions and covariance matrix of phylogenetic spectra , 1994 .

[34]  Detlef D. Leipe,et al.  Small subunit ribosomal RNA+ of Hexamita inflata and the quest for the first branch in the eukaryotic tree. , 1993, Molecular and biochemical parasitology.

[35]  M. Steel,et al.  General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. , 1997, Molecular phylogenetics and evolution.

[36]  L. Jin,et al.  Limitations of the evolutionary parsimony method of phylogenetic analysis. , 1990, Molecular biology and evolution.

[37]  J. A. Cavender Taxonomy with confidence , 1978 .

[38]  J. Farris A Probability Model for Inferring Evolutionary Trees , 1973 .

[39]  W. Fitch,et al.  Evidence from nuclear sequences that invariable sites should be considered when sequence divergence is calculated. , 1989, Molecular biology and evolution.

[40]  M. Kimura Estimation of evolutionary distances between homologous nucleotide sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[41]  J. Bull,et al.  Partitioning and combining data in phylogenetic analysis , 1993 .

[42]  M. Nei Molecular Evolutionary Genetics , 1987 .