Missing data and the design of phylogenetic analyses

Concerns about the deleterious effects of missing data may often determine which characters and taxa are included in phylogenetic analyses. For example, researchers may exclude taxa lacking data for some genes or exclude a gene lacking data in some taxa. Yet, there may be very little evidence to support these decisions. In this paper, I review the effects of missing data on phylogenetic analyses. Recent simulations suggest that highly incomplete taxa can be accurately placed in phylogenies, as long as many characters have been sampled overall. Furthermore, adding incomplete taxa can dramatically improve results in some cases by subdividing misleading long branches. Adding characters with missing data can also improve accuracy, although there is a risk of long-branch attraction in some cases. Consideration of how missing data does (or does not) affect phylogenetic analyses may allow researchers to design studies that can reconstruct large phylogenies quickly, economically, and accurately.

[1]  Sudhir Kumar,et al.  Incomplete taxon sampling is not a problem for phylogenetic inference , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[3]  D. Hillis,et al.  Taxonomic sampling, phylogenetic accuracy, and investigator bias. , 1998, Systematic biology.

[4]  S. Poe Evaluation of the strategy of long-branch subdivision to improve the accuracy of phylogenetic methods. , 2003, Systematic biology.

[5]  J. Huelsenbeck,et al.  Application and accuracy of molecular phylogenies. , 1994, Science.

[6]  Junhyong Kim,et al.  Large-scale phylogenies and measuring the performance of phylogenetic estimators. , 1998, Systematic biology.

[7]  M. Novacek Fossils, Topologies, Missing Data, and the Higher Level Phylogeny of Eutherian Mammals , 1992 .

[8]  Mark Wilkinson,et al.  Coping with Abundant Missing Entries in Phylogenetic Inference Using Parsimony , 1995 .

[9]  B. Rannala,et al.  Taxon sampling and the accuracy of large phylogenies. , 1998, Systematic biology.

[10]  D. Posada,et al.  Simple (wrong) models for complex trees: a case from retroviridae. , 2001, Molecular biology and evolution.

[11]  D. Hillis Approaches for Assessing Phylogenetic Accuracy , 1995 .

[12]  J. G. Burleigh,et al.  Prospects for Building the Tree of Life from Large Sequence Databases , 2004, Science.

[13]  J. Gauthier Saurischian monophyly and the origin of birds , 1986 .

[14]  D. Hillis Inferring complex phytogenies , 1996, Nature.

[15]  P. Holland,et al.  Phylogenomics of eukaryotes: impact of missing data on large alignments. , 2004, Molecular biology and evolution.

[16]  Junhyong Kim,et al.  GENERAL INCONSISTENCY CONDITIONS FOR MAXIMUM PARSIMONY: EFFECTS OF BRANCH LENGTHS AND INCREASING NUMBERS OF TAXA , 1996 .

[17]  K. Crandall,et al.  The causes and consequences of HIV evolution , 2004, Nature Reviews Genetics.

[18]  Giovanna Morelli,et al.  Microevolution and history of the plague bacillus, Yersinia pestis. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  J. Huelsenbeck Performance of Phylogenetic Methods in Simulation , 1995 .

[20]  J. Wiens,et al.  Missing data, incomplete taxa, and phylogenetic accuracy. , 2003, Systematic biology.

[21]  W. Fitch,et al.  Predicting the evolution of human influenza A. , 1999, Science.

[22]  M J Sanderson,et al.  Assessment of the accuracy of matrix representation with parsimony analysis supertree construction. , 2001, Systematic biology.

[23]  J. Wiens Does adding characters with missing data increase or decrease phylogenetic accuracy? , 1998, Systematic biology.

[24]  M. Norell,et al.  Taxonomic revision of Carusia (Reptilia, Squamata) from the late Cretaceous of the Gobi Desert and phylogenetic relationships of anguimorphan lizards. American Museum novitates ; no. 3230 , 1998 .

[25]  Michael D. Hendy,et al.  A Framework for the Quantitative Study of Evolutionary Trees , 1989 .

[26]  Arnold G. Kluge,et al.  AMNIOTE PHYLOGENY AND THE IMPORTANCE OF FOSSILS , 1988, Cladistics : the international journal of the Willi Hennig Society.

[27]  Derrick J. Zwickl,et al.  Increased taxon sampling greatly reduces phylogenetic error. , 2002, Systematic biology.

[28]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[29]  Maureen Kearney,et al.  Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. , 2002, Systematic biology.

[30]  Derrick J. Zwickl,et al.  Increased taxon sampling is advantageous for phylogenetic inference. , 2002, Systematic biology.

[31]  Derrick J. Zwickl,et al.  Is sparse taxon sampling a problem for phylogenetic inference? , 2003, Systematic biology.

[32]  F. Lutzoni,et al.  Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. , 2003, Molecular biology and evolution.

[33]  Oliver Eulenstein,et al.  Obtaining maximal concatenated phylogenetic data sets from large sequence databases. , 2003, Molecular biology and evolution.

[34]  D. Hillis Origins of HIV , 2000, Science.

[35]  Michael J Sanderson,et al.  The challenge of constructing large phylogenetic trees. , 2003, Trends in plant science.

[36]  J. Wiens,et al.  Hylid frog phylogeny and sampling strategies for speciose clades. , 2005, Systematic biology.

[37]  M. Donoghue,et al.  The Importance of Fossils in Phylogeny Reconstruction , 1989 .

[38]  Sudhir Kumar,et al.  Taxon sampling, bioinformatics, and phylogenomics. , 2003, Systematic biology.

[39]  Timothy B. Rowe,et al.  Definition, diagnosis, and origin of Mammalia , 1988 .

[40]  J. Wiens Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? , 2005, Systematic biology.

[41]  M. Ebach,et al.  Phylogeny of the Trilobite Subgenus Acanthopyge (Lobopyge) , 2001 .

[42]  D. Posada,et al.  Selecting models of nucleotide substitution: an application to human immunodeficiency virus 1 (HIV-1). , 2001, Molecular biology and evolution.

[43]  C Patterson,et al.  Significance of Fossils in Determining Evolutionary Relationships , 1981 .

[44]  L. Grande,et al.  A comprehensive phylogenetic study of amiid fishes (Amiidae) based on comparative skeletal anatomy : an empirical search for interconnected patterns of natural history , 1998 .

[45]  A. Graybeal,et al.  Is it better to add taxa or characters to a difficult phylogenetic problem? , 1998, Systematic biology.

[46]  Andy Purvis,et al.  Phylogenetic supertrees: Assembling the trees of life. , 1998, Trends in ecology & evolution.

[47]  John P. Huelsenbeck,et al.  WHEN ARE FOSSILS BETTER THAN EXTANT TAXA IN PHYLOGENETIC ANALYSIS , 1991 .

[48]  M. Benton,et al.  Missing data and rhynchosaur phylogeny , 1995 .

[49]  D. Swofford,et al.  Taxon sampling revisited , 1999, Nature.

[50]  D. Hillis Inferring complex phylogenies. , 1996, Nature.

[51]  J. Wiens,et al.  Combining data sets with different numbers of taxa for phylogenetic analysis , 1995 .