Missing data in phylogenetic analysis: reconciling results from simulations and empirical data.

existing theoretical framework (Wiens 2003b). Furthermore, many contradictory studies suggesting that missing data are not generally problematic for Bayesian and likelihood analyses (given some assumptions) were not addressed by LEA. Second, the sweeping negative conclusions of LEA are not necessarily supported by their results. LEA find missing data to be problematic primarily when using sets of invariant or saturated characters and/or when obvious rate heterogeneity is ignored. Their results do not support the idea that missing data generally lead to incorrect inferences about topology when informative data are analyzed with appropriate methods. We conduct new simulations under more realistic conditions, and these results show no evidence that missing data generally lead to inaccurate Bayesian estimates of phylogeny. In fact, we show that the practice of excluding characters simply because they contain missing data cells may itself reduce accuracy. We reanalyze the “manipulated” empirical example from LEA and find that, without these artificial “manipulations” of the data, their conclusions are not supported. We also analyze eight empirical data sets, each containing many taxa with extensive missing data. We show that these incomplete taxa are consistently placed into the expected higher taxa, often with very strong support. Overall, our results confirm previous simulation and empirical studies showing that taxa with extensive missing data can be accurately placed in phylogenetic analyses and that adding characters with missing data can be beneficial (at least under some conditions). We conclude by pointing out important areas for future research on the topic of missing data and phylogenetic analysis.

[1]  R. A. Pyron,et al.  Evolutionary and Biogeographic Origins of High Tropical Diversity in Old World Frogs (Ranidae) , 2009, Evolution; international journal of organic evolution.

[2]  Vincent J. Lynch,et al.  DID EGG‐LAYING BOAS BREAK DOLLO'S LAW? PHYLOGENETIC EVIDENCE FOR REVERSAL TO OVIPARITY IN SAND BOAS (ERYX: BOIDAE) , 2010, Evolution; international journal of organic evolution.

[3]  J. Wiens,et al.  Can Parallel Diversification Occur in Sympatry? Repeated Patterns of Body-Size Evolution in Coexisting Clades of North American Salamanders , 2009, Evolution; international journal of organic evolution.

[4]  O. Madsen,et al.  Asynchronous colonization of Madagascar by the four endemic clades of primates, tenrecs, carnivores, and rodents as inferred from nuclear genes. , 2005, Systematic biology.

[5]  J. Wiens,et al.  Missing data, incomplete taxa, and phylogenetic accuracy. , 2003, Systematic biology.

[6]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[7]  D. Wake,et al.  Phylogenetic history underlies elevational biodiversity patterns in tropical salamanders , 2007, Proceedings of the Royal Society B: Biological Sciences.

[8]  R. Honeycutt,et al.  Systematics of Mustelid-Like Carnivores , 1997 .

[9]  Caitlin A. Kuczynski,et al.  Combining phylogenomics and fossils in higher-level squamate reptile phylogeny: molecular data change the placement of fossil taxa. , 2010, Systematic biology.

[10]  B. Rannala,et al.  Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. , 2004, Systematic biology.

[11]  J. Huelsenbeck Performance of Phylogenetic Methods in Simulation , 1995 .

[12]  R. Honeycutt,et al.  Molecular phylogenetics of myliobatiform fishes (Chondrichthyes: Myliobatiformes), with comments on the effects of missing data on parsimony and likelihood. , 2003, Molecular phylogenetics and evolution.

[13]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[14]  J. Wiens,et al.  Hylid frog phylogeny and sampling strategies for speciose clades. , 2005, Systematic biology.

[15]  J. Wiens,et al.  Missing data and the accuracy of Bayesian phylogenetics , 2008 .

[16]  Rodrigo Gouveia-Oliveira,et al.  MaxAlign: maximizing usable data in an alignment , 2007, BMC Bioinformatics.

[17]  J. Wiens,et al.  INCOMPLETE TAXA, INCOMPLETE CHARACTERS, AND PHYLOGENETIC ACCURACY: IS THERE A MISSING DATA PROBLEM? , 2003 .

[18]  D. Hillis,et al.  Phylogeny and biogeography of a cosmopolitan frog radiation: Late cretaceous diversification resulted in continent-scale endemism in the family ranidae. , 2006, Systematic biology.

[19]  N. Platnick,et al.  ON MISSING ENTRIES IN CLADISTIC ANALYSIS , 1991 .

[20]  K. Holsinger,et al.  Polytomies and Bayesian phylogenetic inference. , 2005, Systematic biology.

[21]  M. Wills,et al.  Fossils impact as hard as living taxa in parsimony analyses of morphology. , 2007, Systematic biology.

[22]  J. Wiens Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? , 2005, Systematic biology.

[23]  Jeremy M. Brown,et al.  The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference , 2009, Systematic biology.

[24]  D. Wake,et al.  Morphological homoplasy, life history evolution, and historical biogeography of plethodontid salamanders inferred from complete mitochondrial genomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  J. Wiens,et al.  Paleontology, genomics, and combined-data phylogenetics: can molecular data improve phylogeny estimation for fossil taxa? , 2009, Systematic biology.

[26]  J. Wiens,et al.  A Revised Phylogeny of Holarctic Treefrogs (Genus Hyla) Based on Nuclear and Mitochondrial DNA Sequences , 2009 .

[27]  Caitlin A. Kuczynski,et al.  Discordant mitochondrial and nuclear gene phylogenies in emydid turtles: implications for speciation and conservation , 2010 .

[28]  P. Holland,et al.  Phylogenomics of eukaryotes: impact of missing data on large alignments. , 2004, Molecular biology and evolution.

[29]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[30]  Stefanie Hartmann,et al.  Using ESTs for phylogenomics: Can one accurately infer a phylogenetic tree from a gappy alignment? , 2008, BMC Evolutionary Biology.

[31]  M. Novacek Fossils, Topologies, Missing Data, and the Higher Level Phylogeny of Eutherian Mammals , 1992 .

[32]  S. Poe Evaluation of the strategy of long-branch subdivision to improve the accuracy of phylogenetic methods. , 2003, Systematic biology.

[33]  M. Donoghue,et al.  The Importance of Fossils in Phylogeny Reconstruction , 1989 .

[34]  W. Duellman,et al.  LOSS AND RE-EVOLUTION OF COMPLEX LIFE CYCLES IN MARSUPIAL FROGS: DOES ANCESTRAL TRAIT RECONSTRUCTION MISLEAD? , 2007, Evolution; international journal of organic evolution.

[35]  J. G. Burleigh,et al.  Prospects for Building the Tree of Life from Large Sequence Databases , 2004, Science.

[36]  L. Vitt,et al.  The phylogeny of advanced snakes (Colubroidea), with discovery of a new subfamily and comparison of support methods for likelihood trees. , 2011, Molecular phylogenetics and evolution.

[37]  J. Ohn,et al.  Does Adding Characters with Missing Data Increase or Decrease Phylogenetic Accuracy ? , 2003 .

[38]  D. Soltis,et al.  Phylogeny of extant and fossil Juglandaceae inferred from the integration of molecular and morphological data sets. , 2007, Systematic biology.

[39]  D. Pearl,et al.  High-resolution species trees without concatenation , 2007, Proceedings of the National Academy of Sciences.

[40]  F. Lutzoni,et al.  Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. , 2003, Molecular biology and evolution.

[41]  John J. Wiens,et al.  Global Patterns of Diversification and Species Richness in Amphibians , 2007, The American Naturalist.

[42]  Derrick J. Zwickl,et al.  Phylogenetic relationships of the dwarf boas and a comparison of Bayesian and bootstrap measures of phylogenetic support. , 2002, Molecular phylogenetics and evolution.

[43]  P. Moler,et al.  THE AMPHIBIAN TREE OF LIFE , 2006 .

[44]  J. Wiens,et al.  Combining data sets with different numbers of taxa for phylogenetic analysis , 1995 .

[45]  J S Anderson,et al.  The phylogenetic trunk: maximal inclusion of taxa with missing data in an analysis of the lepospondyli (Vertebrata, Tetrapoda). , 2001, Systematic biology.

[46]  Robert C Thomson,et al.  Sparse supermatrices for phylogenetic inference: taxonomy, alignment, rogue taxa, and the phylogeny of living turtles. , 2010, Systematic biology.

[47]  B. Rannala,et al.  Taxon sampling and the accuracy of large phylogenies. , 1998, Systematic biology.

[48]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[49]  John P. Huelsenbeck,et al.  WHEN ARE FOSSILS BETTER THAN EXTANT TAXA IN PHYLOGENETIC ANALYSIS , 1991 .

[50]  Maureen Kearney,et al.  Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. , 2002, Systematic biology.

[51]  D. Hillis,et al.  Phylogeny of North American fireflies (Coleoptera: Lampyridae): implications for the evolution of light signals. , 2007, Molecular phylogenetics and evolution.

[52]  D. Wake,et al.  Extreme morphological and ecological homoplasy in tropical salamanders , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[53]  John J. Wiens,et al.  Missing data and the design of phylogenetic analyses , 2006, J. Biomed. Informatics.

[54]  Mark Wilkinson,et al.  Coping with Abundant Missing Entries in Phylogenetic Inference Using Parsimony , 1995 .

[55]  Caitlin A. Kuczynski,et al.  Phylogenetic relationships of phrynosomatid lizards based on nuclear and mitochondrial data, and a revised phylogeny for Sceloporus. , 2010, Molecular phylogenetics and evolution.

[56]  M. Sanderson Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. , 2002, Molecular biology and evolution.

[57]  Caitlin A. Kuczynski,et al.  Branch lengths, support, and congruence: testing the phylogenomic approach with 20 nuclear loci in snakes. , 2008, Systematic biology.