Maximum Likelihood Phylogenetic Inference is Consistent on Multiple Sequence Alignments, with or without Gaps

We prove that maximum likelihood phylogenetic inference is consistent on gapped multiple sequence alignments (MSAs) as long as substitution rates across each edge are greater than zero, under mild assumptions on the structure of the alignment. Under these assumptions, maximum likelihood will asymptotically recover the tree with edge lengths corresponding to the mean number of substitutions per site on each edge. This refutes Warnow's recent suggestion (Warnow 2012) that maximum likelihood phylogenetic inference might be statistically inconsistent when gaps are treated as missing data, even if the MSA is correct. We also derive a simple new proof of maximum likelihood consistency of ungapped alignments.

[1]  J. S. Rogers,et al.  Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. , 2001, Systematic biology.

[2]  Bhalchandra D Thatte,et al.  Invertibility of the TKF model of sequence evolution. , 2006, Mathematical biosciences.

[3]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[4]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[5]  M. Steel,et al.  Twisted trees and inconsistency of tree estimation when gaps are treated as missing data - The impact of model mis-specification in distance corrections. , 2015, Molecular phylogenetics and evolution.

[6]  S. Jeffery Evolution of Protein Molecules , 1979 .

[7]  A. Wald Note on the Consistency of the Maximum Likelihood Estimate , 1949 .

[8]  H. Philippe,et al.  Impact of missing data on phylogenies inferred from empirical phylogenomic data sets. , 2013, Molecular biology and evolution.

[9]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[10]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[11]  Michael I. Jordan,et al.  Evolutionary inference via the Poisson Indel Process , 2012, Proceedings of the National Academy of Sciences.

[12]  A. Roychoudhury Consistency of the Maximum Likelihood Estimator of Evolutionary Tree , 2014, 1405.0760.

[13]  Joseph Felsenstein,et al.  Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters , 1973 .

[14]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[15]  D. Penny,et al.  Missing Data and Influential Sites: Choice of Sites for Phylogenetic Analysis Can Be As Important As Taxon Sampling and Model Choice , 2013, Genome biology and evolution.

[16]  M. Gil,et al.  Phylogenetic assessment of alignments reveals neglected tree signal in gaps , 2010, Genome Biology.

[17]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[18]  J. Farris Likelihood and Inconsistency , 1999, Cladistics : the international journal of the Willi Hennig Society.

[19]  J. S. Rogers,et al.  On the consistency of maximum likelihood estimation of phylogenetic trees from nucleotide sequences. , 1997, Systematic biology.

[20]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[21]  D. Altman,et al.  Missing data , 2007, BMJ : British Medical Journal.

[22]  Ziheng Yang Statistical Properties of the Maximum Likelihood Method of Phylogenetic Estimation and Comparison With Distance Matrix Methods , 1994 .

[23]  T. Warnow Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent , 2012, PLoS currents.