Obtaining maximal concatenated phylogenetic data sets from large sequence databases.

To improve the accuracy of tree reconstruction, phylogeneticists are extracting increasingly large multigene data sets from sequence databases. Determining whether a database contains at least k genes sampled from at least m species is an NP-complete problem. However, the skewed distribution of sequences in these databases permits all such data sets to be obtained in reasonable computing times even for large numbers of sequences. We developed an exact algorithm for obtaining the largest multigene data sets from a collection of sequences. The algorithm was then tested on a set of 100,000 protein sequences of green plants and used to identify the largest multigene ortholog data sets having at least 3 genes and 6 species. The distribution of sizes of these data sets forms a hollow curve, and the largest are surprisingly small, ranging from 62 genes by 6 species, to 3 genes by 65 species, with more symmetrical data sets of around 15 taxa by 15 genes. These upper bounds to sequence concatenation have important implications for building the tree of life from large sequence databases.

[1]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[2]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[3]  Tandy J. Warnow,et al.  A few logs suffice to build (almost) all trees (I) , 1999, Random Struct. Algorithms.

[4]  J. Kim,et al.  Scaling of Accuracy in Extremely Large Phylogenetic Trees , 2000, Pacific Symposium on Biocomputing.

[5]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[6]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[7]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[8]  Milind Dawande,et al.  On Bipartite and Multipartite Clique Problems , 2001, J. Algorithms.

[9]  Rolf Apweiler,et al.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins , 2001, Nucleic Acids Res..

[10]  D. Soltis,et al.  Phylogenetics of flowering plants based on combined analysis of plastid atpB and rbcL gene sequences. , 2000, Systematic biology.

[11]  Andy Purvis,et al.  Phylogenetic supertrees: Assembling the trees of life. , 1998, Trends in ecology & evolution.

[12]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[13]  R. Olmstead,et al.  Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. , 2000, American journal of botany.

[14]  M. Miya,et al.  Use of mitogenomic information in teleostean molecular phylogenetics: a tree-based exploration under the maximum-parsimony optimality criterion. , 2000, Molecular phylogenetics and evolution.

[15]  Orton,et al.  Inferring Complex Phylogenies Using Parsimony : An Empirical Approach Using Three Large DNA Data Sets for Angiosperms , 2003 .

[16]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[17]  G. Pertea,et al.  Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). , 2002, Genome research.

[18]  Dorit S. Hochbaum,et al.  Approximating Clique and Biclique Problems , 1998, J. Algorithms.

[19]  Michael M. Miyamoto,et al.  Molecular and Morphological Supertrees for Eutherian (Placental) Mammals , 2001, Science.

[20]  R. Wilkerson,et al.  Toward understanding Anophelinae (Diptera, Culicidae) phylogeny: insights from nuclear single-copy genes and the weight of evidence. , 2001, Systematic biology.

[21]  J. Wiens Does adding characters with missing data increase or decrease phylogenetic accuracy? , 1998, Systematic biology.

[22]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[23]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[24]  J. Bull,et al.  Partitioning and combining data in phylogenetic analysis , 1993 .

[25]  James F. Smith Phylogenetics of seed plants : An analysis of nucleotide sequences from the plastid gene rbcL , 1993 .

[26]  D. Hillis Inferring complex phytogenies , 1996, Nature.

[27]  Michael J. Stanhope,et al.  Universal trees based on large combined protein sequence data sets , 2001, Nature Genetics.

[28]  D. Hillis Inferring complex phylogenies. , 1996, Nature.

[29]  Diana J. Kao,et al.  Parallel adaptive radiations in two major clades of placental mammals , 2001, Nature.

[30]  D. Soltis,et al.  Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology , 1999, Nature.

[31]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[32]  Terry Gaasterland,et al.  The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Peter L. Hammer,et al.  Consensus algorithms for the generation of all maximal bicliques , 2004, Discret. Appl. Math..

[34]  Mark W. Chase,et al.  The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes , 1999, Nature.

[35]  S. O’Brien,et al.  Molecular phylogenetics and the origins of placental mammals , 2001, Nature.

[36]  R DeSalle,et al.  Multiple sources of character information and the phylogeny of Hawaiian drosophilids. , 1997, Systematic biology.

[37]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[38]  G Perrière,et al.  Bacterial molecular phylogeny using supertree approach. , 2001, Genome informatics. International Conference on Genome Informatics.

[39]  Maureen Kearney,et al.  Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. , 2002, Systematic biology.

[40]  N. Grishin,et al.  Genome trees and the tree of life. , 2002, Trends in genetics : TIG.