What can and what cannot be inferred from pairwise sequence comparisons?

We address questions of identifiability in molecular phylogeny, the art of reconstructing the history of a sample of sequences given just the sequences at the leaves of the phylogenetic tree. Here, the 'history' consists of the tree topology, plus the transition probabilities which define the Markov process of sequence evolution along the branches of the tree. It is assumed that sequences have infinite length, and the pairwise joint distributions of letters at the leaves is taken to be known. We focus on two cases: (1) If the sites of a sequence evolve identically and independently, the topology can be reconstructed, but the one-way edge transition matrices cannot. However, the return-trip transition matrices are reconstructible for every edge, up to conjugation in the case of internal edges. (2) If a rate factor varies from site to site, different topologies may produce identical pairwise joint distributions, even under the same distribution of rate factors. Consequently, identifiability of the topology is lost on the basis of pairwise sequence comparisons, even if the distribution of rate factors is known. The results are discussed in the context of additive measures of phylogenetic distance.

[1]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[2]  R. Nielsen,et al.  Site-by-site estimation of the rate of substitution and the correlation of rates in mitochondrial DNA. , 1997, Systematic biology.

[3]  J. Lake,et al.  Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[4]  A. von Haeseler,et al.  Distance measures in terms of substitution processes. , 1999, Theoretical population biology.

[5]  W. A. Beyer,et al.  Additive evolutionary trees. , 1977, Journal of theoretical biology.

[6]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[7]  J. A. Cavender Taxonomy with confidence , 1978 .

[8]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[9]  J. Lake,et al.  Phylogenetic inference: how much evolutionary history is knowable? , 1997, Molecular biology and evolution.

[10]  M. Steel,et al.  General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. , 1997, Molecular phylogenetics and evolution.

[11]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[12]  Joseph T. Chang,et al.  Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. , 1996, Mathematical biosciences.

[13]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[14]  S. Benzer,et al.  ON THE TOPOGRAPHY OF THE GENETIC FINE STRUCTURE. , 1961, Proceedings of the National Academy of Sciences of the United States of America.

[15]  A. Dress,et al.  Reconstructing the shape of a tree from observed dissimilarity data , 1986 .

[16]  Joseph T. Chang,et al.  Reconstruction of Evolutionary Trees from Pairwise Distributions on Current Species , 1992 .

[17]  Z. Yang,et al.  Approximate methods for estimating the pattern of nucleotide substitution and the variation of substitution rates among sites. , 1996, Molecular biology and evolution.

[18]  László A. Székely,et al.  Reconstructing Trees When Sequence Sites Evolve at Variable Rates , 1994, J. Comput. Biol..

[19]  J. Farris A Probability Model for Inferring Evolutionary Trees , 1973 .

[20]  J. Oliver,et al.  The general stochastic model of nucleotide substitution. , 1990, Journal of theoretical biology.

[21]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[22]  W. Li,et al.  A general additive distance with time-reversibility and rate variation among nucleotide sites. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[24]  J. Hartigan,et al.  Asynchronous distance between homologous DNA sequences. , 1987, Biometrics.

[25]  John G. Kemeny,et al.  Finite Markov chains , 1960 .