Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences

MOTIVATION Most phylogenetic methods assume that the sequences of nucleotides or amino acids have evolved under stationary, reversible and homogeneous conditions. When these assumptions are violated by the data, there is an increased probability of errors in the phylogenetic estimates. Methods to examine aligned sequences for these violations are available, but they are rarely used, possibly because they are not widely known or because they are poorly understood. RESULTS We describe and compare the available tests for symmetry of k-dimensional contingency tables from homologous sequences, and develop two new tests to evaluate different aspects of the evolutionary processes. For any pair of sequences, we consider a partition of the test for symmetry into a test for marginal symmetry and a test for internal symmetry. The proposed tests can be used to identify appropriate models for estimation of evolutionary relationships under a Markovian model. Simulations under more or less complex evolutionary conditions were done to display the performance of the tests. Finally, the tests were applied to an alignment of small-subunit ribosomal RNA sequences of five species of bacteria to outline the evolutionary processes under which they evolved. AVAILABILITY Programs written in R to do the tests on nucleotides are available from http://www.maths.usyd.edu.au/u/johnr/testsym/

[1]  John Robinson,et al.  Estimation of Phylogeny Using a General Markov Model , 2005, Evolutionary bioinformatics online.

[2]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[3]  S. Kullback,et al.  Symmetry and Marginal Homogeneity of an r×r Contingency Table , 1969 .

[4]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[5]  J. Hartigan,et al.  Statistical Analysis of Hominoid Molecular Evolution , 1987 .

[6]  G. Serio,et al.  A new method for calculating evolutionary substitution rates , 2005, Journal of Molecular Evolution.

[7]  Peter G Foster,et al.  Modeling compositional heterogeneity. , 2004, Systematic biology.

[8]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[9]  P J Waddell,et al.  Using novel phylogenetic methods to evaluate mammalian mtDNA, including amino acid-invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the positions of hedgehog, armadillo, and elephant. , 1999, Systematic biology.

[10]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[11]  Emil L. Smith Mammalian Protein Metabolism. Volumes I and II. , 1965 .

[12]  M. Steel,et al.  General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. , 1997, Molecular phylogenetics and evolution.

[13]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[14]  David Posada,et al.  MODELTEST: testing the model of DNA substitution , 1998, Bioinform..

[15]  M. Gouy,et al.  Inferring phylogenies from DNA sequences of unequal base compositions. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[16]  V. P. Bhapkar A Note on the Equivalence of Two Test Criteria for Hypotheses in Categorical Data , 1966 .

[17]  Xiao-Li Meng,et al.  POSTERIOR PREDICTIVE ASSESSMENT OF MODEL FITNESS VIA REALIZED DISCREPANCIES , 1996 .

[18]  A. Bowker,et al.  A test for symmetry in contingency tables. , 1948, Journal of the American Statistical Association.

[19]  A. Stuart A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION , 1955 .

[20]  Faisal Ababneh,et al.  Generation of the Exact Distribution and Simulation of Matched Nucleotide Sequences on a Phylogenetic Tree , 2006, J. Math. Model. Algorithms.

[21]  Jonathan P. Bollback,et al.  Bayesian model adequacy and choice in phylogenetics. , 2002, Molecular biology and evolution.

[22]  A Rzhetsky,et al.  Tests of applicability of several substitution models for DNA sequence data. , 1995, Molecular biology and evolution.

[23]  Z. Yang,et al.  On the use of nucleic acid sequences to infer early branchings in the tree of life. , 1995, Molecular biology and evolution.

[24]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[25]  Faisal Ababneh,et al.  The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. , 2004, Systematic biology.

[26]  S. Ho,et al.  Tracing the decay of the historical signal in biological sequence data. , 2004, Systematic biology.

[27]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .