Phylogenomics with incomplete taxon coverage: the limits to inference

BackgroundPhylogenomic studies based on multi-locus sequence data sets are usually characterized by partial taxon coverage, in which sequences for some loci are missing for some taxa. The impact of missing data has been widely studied in phylogenetics, but it has proven difficult to distinguish effects due to error in tree reconstruction from effects due to missing data per se. We approach this problem using a explicitly phylogenomic criterion of success, decisiveness, which refers to whether the pattern of taxon coverage allows for uniquely defining a single tree for all taxa.ResultsWe establish theoretical bounds on the impact of missing data on decisiveness. Results are derived for two contexts: a fixed taxon coverage pattern, such as that observed from an already assembled data set, and a randomly generated pattern derived from a process of sampling new data, such as might be observed in an ongoing comparative genomics sequencing project. Lower bounds on how many loci are needed for decisiveness are derived for the former case, and both lower and upper bounds for the latter. When data are not decisive for all trees, we estimate the probability of decisiveness and the chances that a given edge in the tree will be distinguishable. Theoretical results are illustrated using several empirical examples constructed by mining sequence databases, genomic libraries such as ESTs and BACs, and complete genome sequences.ConclusionPartial taxon coverage among loci can limit phylogenomic inference by making it impossible to distinguish among multiple alternative trees. However, even though lack of decisiveness is typical of many sparse phylogenomic data sets, it is often still possible to distinguish a large fraction of edges in the tree.

[1]  Matthias Platzer,et al.  Mapping human genetic ancestry. , 2007, Molecular biology and evolution.

[2]  H. Shaffer,et al.  Annual review of ecology, evolution, and systematics , 2003 .

[3]  Mike A. Steel,et al.  Algorithmic Aspects of Tree Amalgamation , 2000, J. Algorithms.

[4]  Jun Wang,et al.  Analysis of 142 genes resolves the rapid diversification of the rice genus , 2008, Genome Biology.

[5]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[6]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[7]  Laura Salter Kubatko,et al.  Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. , 2009, Theoretical population biology.

[8]  M. Sanderson,et al.  Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. , 2006, Systematic biology.

[9]  M. Steel,et al.  Extension Operations on Sets of Leaf-Labeled Trees , 1995 .

[10]  J. Gatesy,et al.  The supermatrix approach to systematics. , 2007, Trends in ecology & evolution.

[11]  F. Lapointe,et al.  The use and validity of composite taxa in phylogenetic analysis. , 2009, Systematic biology.

[12]  Mike A. Steel,et al.  Characterizing phylogenetically decisive taxon coverage , 2010, Appl. Math. Lett..

[13]  W. Maddison,et al.  Inferring phylogeny despite incomplete lineage sorting. , 2006, Systematic biology.

[14]  Oliver Eulenstein,et al.  Groves of Phylogenetic Trees , 2009 .

[15]  O. Bininda-Emonds,et al.  The evolution of supertrees. , 2004, Trends in ecology & evolution.

[16]  Colin N. Dewey,et al.  Fine-Scale Phylogenetic Discordance across the House Mouse Genome , 2009, PLoS genetics.

[17]  Adam Siepel,et al.  Phylogenomics of primates and their ancestral populations. , 2009, Genome research.

[18]  J. Wiens,et al.  Missing data, incomplete taxa, and phylogenetic accuracy. , 2003, Systematic biology.

[19]  Oliver Eulenstein,et al.  Obtaining maximal concatenated phylogenetic data sets from large sequence databases. , 2003, Molecular biology and evolution.

[20]  Jeremy M. Brown,et al.  The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference , 2009, Systematic biology.

[21]  Alexandros Stamatakis,et al.  Efficient computation of the phylogenetic likelihood function on multi-gene alignments and multi-core architectures , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[22]  John J. Wiens,et al.  Missing data and the design of phylogenetic analyses , 2006, J. Biomed. Informatics.

[23]  J. G. Burleigh,et al.  Prospects for Building the Tree of Life from Large Sequence Databases , 2004, Science.

[24]  David Q. Matus,et al.  Broad phylogenomic sampling improves resolution of the animal tree of life , 2008, Nature.

[25]  B. Rannala,et al.  Phylogenetic inference using whole genomes. , 2008, Annual review of genomics and human genetics.

[26]  D. Pearl,et al.  High-resolution species trees without concatenation , 2007, Proceedings of the National Academy of Sciences.

[27]  N. Galtier,et al.  Dealing with incongruence in phylogenomic analyses , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[28]  P. Holland,et al.  Phylogenomics of eukaryotes: impact of missing data on large alignments. , 2004, Molecular biology and evolution.

[29]  Maureen Kearney,et al.  Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. , 2002, Systematic biology.

[30]  M. Martindale,et al.  Assessing the root of bilaterian animals with scalable phylogenomic methods , 2009, Proceedings of the Royal Society B: Biological Sciences.

[31]  P. Goloboff HOMOPLASY AND THE CHOICE AMONG CLADOGRAMS , 1991, Cladistics : the international journal of the Willi Hennig Society.

[32]  Hiroshi Tanaka,et al.  A likelihood look at the supermatrix-supertree controversy. , 2009, Gene.

[33]  Stefanie Hartmann,et al.  Using ESTs for phylogenomics: Can one accurately infer a phylogenetic tree from a gappy alignment? , 2008, BMC Evolutionary Biology.

[34]  L. Kubatko Identifying hybridization events in the presence of coalescence via model selection. , 2009, Systematic biology.

[35]  L Lacey Knowles,et al.  Estimating species trees: methods of phylogenetic analysis when there is incongruence across genes. , 2009, Systematic biology.

[36]  J. Farris,et al.  Homoplasy Increases Phylogenetic Structure , 1999 .

[37]  J. Farris,et al.  Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups , 2009, Cladistics : the international journal of the Willi Hennig Society.

[38]  Mark Johnston,et al.  Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. , 2009, Molecular biology and evolution.

[39]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[40]  S. Edwards IS A NEW AND GENERAL THEORY OF MOLECULAR SYSTEMATICS EMERGING? , 2009, Evolution; international journal of organic evolution.

[41]  Olivier Gascuel,et al.  Genomics, biogeography, and the diversification of placental mammals , 2007, Proceedings of the National Academy of Sciences.

[42]  Mira V. Han,et al.  Gene Family Evolution across 12 Drosophila Genomes , 2007, PLoS genetics.

[43]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[44]  John P. Huelsenbeck,et al.  WHEN ARE FOSSILS BETTER THAN EXTANT TAXA IN PHYLOGENETIC ANALYSIS , 1991 .

[45]  Olivier Gascuel,et al.  Reconstructing evolution : new mathematical and computational advances , 2007 .

[46]  M. Telford Phylogenomics , 2007, Current Biology.

[47]  Noah A Rosenberg,et al.  Gene tree discordance, phylogenetic inference and the multispecies coalescent. , 2009, Trends in ecology & evolution.

[48]  Alan M. Moses,et al.  Widespread Discordance of Gene Trees with Species Tree in Drosophila: Evidence for Incomplete Lineage Sorting , 2006, PLoS genetics.

[49]  L. Stein,et al.  Species trees from highly incongruent gene trees in rice. , 2009, Systematic biology.

[50]  M. Donoghue,et al.  Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches , 2009, BMC Evolutionary Biology.

[51]  L. Stein,et al.  Phylogenomic Analysis of BAC-end Sequence Libraries in Oryza (Poaceae) , 2010 .

[52]  Carol Soderlund,et al.  Sequencing, Mapping, and Analysis of 27,455 Maize Full-Length cDNAs , 2009, PLoS genetics.

[53]  Liang Liu,et al.  Maximum tree: a consistent estimator of the species tree , 2010, Journal of mathematical biology.

[54]  P. Humphries,et al.  Combinatorial Aspects of Leaf-Labelled Trees , 2008 .

[55]  Duhong Chen,et al.  The PhyLoTA Browser: processing GenBank for molecular phylogenetics research. , 2008, Systematic biology.

[56]  M. Donoghue,et al.  The Importance of Fossils in Phylogeny Reconstruction , 1989 .