Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes.

A comprehensive phylogeny of papilionoid legumes was inferred from sequences of 2228 taxa in GenBank release 147. A semiautomated analysis pipeline was constructed to download, parse, assemble, align, combine, and build trees from a pool of 11,881 sequences. Initial steps included all-against-all BLAST similarity searches coupled with assembly, using a novel strategy for building length-homogeneous primary sequence clusters. This was followed by a combination of global and local alignment protocols to build larger secondary clusters of locally aligned sequences, thus taking into account the dramatic differences in length of the heterogeneous coding and noncoding sequence data present in GenBank. Next, clusters were checked for the presence of duplicate genes and other potentially misleading sequences and examined for combinability with other clusters on the basis of taxon overlap. Finally, two supermatrices were constructed: a "sparse" matrix based on the primary clusters alone (1794 taxa x 53,977 characters), and a somewhat more "dense" matrix based on the secondary clusters (2228 taxa x 33,168 characters). Both matrices were very sparse, with 95% of their cells containing gaps or question marks. These were subjected to extensive heuristic parsimony analyses using deterministic and stochastic heuristics, including bootstrap analyses. A "reduced consensus" bootstrap analysis was also performed to detect cryptic signal in a subtree of the data set corresponding to a "backbone" phylogeny proposed in previous studies. Overall, the dense supermatrix appeared to provide much more satisfying results, indicated by better resolution of the bootstrap tree, excellent agreement with the backbone papilionoid tree in the reduced bootstrap consensus analysis, few problematic large polytomies in the strict consensus, and less fragmentation of conventionally recognized genera. Nevertheless, at lower taxonomic levels several problems were identified and diagnosed. A large number of methodological issues in supermatrix construction at this scale are discussed, including detection of annotation errors in GenBank sequences; the shortage of effective algorithms and software for local multiple sequence alignment; the difficulty of overcoming effects of fragmentation of data into nearly disjoint blocks in sparse supermatrices; and the lack of informative tools to assess confidence limits in very large trees.

[1]  R. Spellenberg CHROMOSOME NUMBERS AND THEIR CYTOTAXONOMIC SIGNIFICANCE FOR NORTH AMERICAN ASTRAGALUS (FABACEAE) , 1976 .

[2]  J. Lackey Neonotonia, a new generic name to include Glycine wightii (Arnott) Verdcourt (Leguminosae, Papilionoideae). , 1977 .

[3]  Peter H. Raven,et al.  Advances in legume systematics , 1981 .

[4]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[5]  Michael J Sanderson,et al.  CONFIDENCE LIMITS ON PHYLOGENIES: THE BOOTSTRAP REVISITED , 1989, Cladistics : the international journal of the Willi Hennig Society.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  Michael J. Sanderson,et al.  MONOPHYLY OF ANEUPLOID ASTRAGALUS (FABACEAE): EVIDENCE FROM NUCLEAR RIBOSOMAL DNA INTERNAL TRANSCRIBED SPACER SEQUENCES , 1993 .

[8]  M. Sanderson,et al.  Phylogenetic relationships in North American Astragalus (Fabaceae) based on chloroplast DNA restriction site variation , 1993 .

[9]  A. Liston,et al.  The phylogenetic position of the genus Astragalus (fabaceae): Evidence from the chloroplast genes rpoC1 and rpoC2 , 1994 .

[10]  B. Mishler Cladistic analysis of molecular and morphological data. , 1994, American journal of physical anthropology.

[11]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[12]  M. Lavin,et al.  Phylogenetic systematics and biogeography of the Tribe Robinieae(Leguminosae) , 1995 .

[13]  Manolo Gouy,et al.  SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny , 1996, Comput. Appl. Biosci..

[14]  Roderic D. M. Page,et al.  FORUM ON CONSENSUS, CONFIDENCE, AND "TOTAL EVIDENCE" , 1996 .

[15]  Roderic D. M. Page,et al.  ON CONSENSUS, CONFIDENCE, AND “TOTAL EVIDENCE” , 1996 .

[16]  M. Wilkinson,et al.  Majority-rule reduced consensus trees and their use in bootstrapping. , 1996, Molecular biology and evolution.

[17]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[18]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[19]  Andy Purvis,et al.  Phylogenetic supertrees: Assembling the trees of life. , 1998, Trends in ecology & evolution.

[20]  Fredrik Ronquist Fast Fitch-Parsimony Algorithms for Large Data Sets , 1998 .

[21]  Liisa Holm,et al.  COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[22]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[23]  M. Sanderson,et al.  Evidence on the monophyly of Astragalus (Fabaceae) and its major subgroups based on nuclear ribosomal DNA ITS and chloroplast DNA trnL intron data , 1999 .

[24]  J. L. Gittleman,et al.  Building large trees by combining phylogenetic information: a complete phylogeny of the extant Carnivora (Mammalia) , 1999, Biological reviews of the Cambridge Philosophical Society.

[25]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[26]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[27]  J. Bennetzen,et al.  Plant retrotransposons. , 1999, Annual review of genetics.

[28]  P. Goloboff Analyzing Large Data Sets in Reasonable Times: Solutions for Composite Optima , 1999, Cladistics : the international journal of the Willi Hennig Society.

[29]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[30]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[31]  K. Nixon,et al.  Phylogeny reconstruction using duplicate genes. , 2000, Molecular biology and evolution.

[32]  M. Sanderson,et al.  Phylogenetic systematics of the tribe Millettieae (Leguminosae) based on chloroplast trnK/matK sequences and its implications for evolutionary patterns in Papilionoideae. , 2000, American journal of botany.

[33]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[34]  M J Sanderson,et al.  Improved bootstrap confidence limits in large-scale phylogenies, with an example from Neo-Astragalus (Leguminosae). , 2000, Systematic biology.

[35]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[36]  K. Nixon,et al.  The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis , 1999, Cladistics : the international journal of the Willi Hennig Society.

[37]  R. Olmstead,et al.  A simulation study of reduced tree-search effort in bootstrap resampling analysis. , 2000, Systematic biology.

[38]  Michael J. Sanderson,et al.  MOLECULAR PHYLOGENY OF THE "TEMPERATE HERBACEOUS TRIBES" OF PAPILIONOID LEGUMES: A SUPERTREE APPROACH , 2000 .

[39]  Wei Qian,et al.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. , 2000, Molecular biology and evolution.

[40]  P. Herendeen,et al.  Advances in legume systematics, part 9 , 2000 .

[41]  Ouglas,et al.  Comparison of Three Methods for Estimating Internal Support on Phylogenetic Trees , 2001 .

[42]  J. D. Thompson,et al.  Towards a reliable objective function for multiple sequence alignments. , 2001, Journal of molecular biology.

[43]  M. Crisp,et al.  Phylogeny and embyro sac evolution in the endemic Australasian Papilionoid tribes Mirbelieae and Bossiaeeae , 2001 .

[44]  R. Pennington,et al.  The dalbergioid legumes (Fabaceae): delimitation of a pantropical monophyletic clade. , 2001, American journal of botany.

[45]  Jimin Pei,et al.  AL2CO: calculation of positional conservation in a protein sequence alignment , 2001, Bioinform..

[46]  Michael J Benton,et al.  A genus-level supertree of the Dinosauria , 2002, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[47]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[48]  Erik L L Sonnhammer,et al.  Quality assessment of multiple alignment programs , 2002, FEBS letters.

[49]  M. Luckow,et al.  The Rest of the Iceberg. Legume Diversity and Evolution in a Phylogenetic Context1 , 2003, Plant Physiology.

[50]  N. Moran,et al.  From Gene Trees to Organismal Phylogeny in Prokaryotes:The Case of the γ-Proteobacteria , 2003, PLoS biology.

[51]  Sokoloff,et al.  PHYLOGENETIC ANALYSES OF TRIBE LOTEAE ( LEGUMINOSAE ) : IMPLICATIONS FOR CLASSIFICATION AND BIOGEOGRAPHY , 2003 .

[52]  C. Fraser,et al.  Phylogenomics: Intersection of Evolution and Genomics , 2003, Science.

[53]  Michael J Sanderson,et al.  The challenge of constructing large phylogenetic trees. , 2003, Trends in plant science.

[54]  Nicolas Salamin,et al.  Assessing internal support with large phylogenetic DNA matrices. , 2003, Molecular phylogenetics and evolution.

[55]  Tandy J. Warnow,et al.  Better Hill-Climbing Searches for Parsimony , 2003, WABI.

[56]  Oliver Eulenstein,et al.  Obtaining maximal concatenated phylogenetic data sets from large sequence databases. , 2003, Molecular biology and evolution.

[57]  C. Notredame,et al.  Tcoffee add igs: a web server for computing, evaluating and combining multiple sequence alignments , 2003, Nucleic Acids Res..

[58]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[59]  J. Lundberg,et al.  An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants : APG II THE ANGIOSPERM PHYLOGENY GROUP * , 2003 .

[60]  J. Doyle,et al.  Chloroplast-Expressed Glutamine Synthetase in Glycine and Related Leguminosae: Phylogeny, Gene Duplication, and Ancient Polyploidy , 2009 .

[61]  D. Soltis,et al.  Phylogenetic analyses and perianth evolution in basal angiosperms , 2003 .

[62]  Apgii An update of the angiosperm phylogeny group classification for the orders and families of flowering plants : APGII , 2003 .

[63]  Alexander Isaev,et al.  PyEvolve: a toolkit for statistical modelling of molecular evolution , 2004, BMC Bioinformatics.

[64]  Martin F. W Ojciechowski,et al.  PHYLOGENETIC ANALYSES OF TRIBES TRIFOLIEAE AND VICIEAE, BASED ON SEQUENCES OF THE PLASTID GENE matK (PAPILIONOIDEAE: LEGUMINOSAE) , 2003 .

[65]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[66]  M. Eisen,et al.  Why PLoS Became a Publisher , 2003, PLoS biology.

[67]  V. Goremykin,et al.  Analysis of the Amborella trichopoda chloroplast genome sequence suggests that amborella is not a basal angiosperm. , 2003, Molecular biology and evolution.

[68]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[69]  Arndt von Haeseler,et al.  Shortest triplet clustering: reconstructing large phylogenies using representative sets , 2005, BMC Bioinformatics.

[70]  M. McMahon,et al.  Phylogeny of amorpheae (fabaceae: papilionoideae). , 2004, American journal of botany.

[71]  E. Koonin,et al.  Coelomata and not Ecdysozoa: evidence from genome-wide phylogenetic analysis. , 2003, Genome research.

[72]  Michael J. Sanderson,et al.  EVOLUTION OF GENOME SIZE IN PINES (PINUS) AND ITS LIFE‐HISTORY CORRELATES: SUPERTREE ANALYSES , 2004, Evolution; international journal of organic evolution.

[73]  Charles Semple,et al.  Phylogenetic Supertrees , 2004, Computational Biology.

[74]  M. Nei,et al.  Prospects for inferring very large phylogenies by using the neighbor-joining method. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[75]  M. Sanderson,et al.  A phylogeny of legumes (Leguminosae) based on analysis of the plastid matK gene resolves many well-supported subclades within the family. , 2004, American journal of botany.

[76]  Christopher J. Lee,et al.  Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems , 2004, Bioinform..

[77]  P. Holland,et al.  Phylogenomics of eukaryotes: impact of missing data on large alignments. , 2004, Molecular biology and evolution.

[78]  J. G. Burleigh,et al.  Prospects for Building the Tree of Life from Large Sequence Databases , 2004, Science.

[79]  John Gatesy,et al.  Inconsistencies in arguments for the supertree approach: supermatrices versus supertrees of Crocodylia. , 2004, Systematic biology.

[80]  J. Reyes,et al.  The GATA Family of Transcription Factors in Arabidopsis and Rice1 , 2004, Plant Physiology.

[81]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[82]  Roderic D. M. Page,et al.  A Taxonomic Search Engine: Federating taxonomic databases using web services , 2005, BMC Bioinformatics.

[83]  J. Farris,et al.  Simultaneous parsimony jackknife analysis of 2538rbcL DNA sequences reveals support for major clades of green plants, land plants, seed plants and flowering plants , 1998, Plant Systematics and Evolution.

[84]  Michael Kaufmann,et al.  BMC Bioinformatics BioMed Central , 2005 .

[85]  Explore Configuring,et al.  A Simulation Study to , 2004 .

[86]  R Henrik Nilsson,et al.  Automated phylogenetic taxonomy: an example in the homobasidiomycetes (mushroom-forming fungi). , 2005, Systematic biology.

[87]  H. Philippe,et al.  Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. , 2005, Molecular biology and evolution.

[88]  K. Müller The efficiency of different search strategies in estimating parsimony jackknife, bootstrap, and Bremer support , 2005, BMC Evolutionary Biology.

[89]  Erik L. L. Sonnhammer,et al.  Automatic assessment of alignment quality , 2005, Nucleic acids research.

[90]  Robert Belshaw,et al.  BlastAlign: a program that uses blast to align problematic nucleotide sequences , 2005, Bioinform..

[91]  R. Gutell,et al.  Phylogenetic Analyses of Basal Angiosperms Based on Nine Plastid, Mitochondrial, and Nuclear Genes , 2005, International Journal of Plant Sciences.

[92]  Oliver Eulenstein,et al.  The shape of supertrees to come: tree shape related properties of fourteen supertree methods. , 2005, Systematic biology.

[93]  Nicolas Salamin,et al.  Towards building the tree of life: a simulation study for all angiosperm genera. , 2005, Systematic biology.

[94]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[95]  F. Delsuc,et al.  Phylogenomics and the reconstruction of the tree of life , 2005, Nature Reviews Genetics.

[96]  Campbell O. Webb,et al.  A Brief History of Seed Size , 2005, Science.

[97]  Thomas Ludwig,et al.  RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees , 2005, Bioinform..

[98]  H. Kishino,et al.  Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea , 1989, Journal of Molecular Evolution.

[99]  S. Prabhakar,et al.  Annotation of cis-regulatory elements by identification, subclassification, and functional assessment of multispecies conserved sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[100]  Jim Leebens-Mack,et al.  Identifying the basal angiosperm node in chloroplast genome phylogenies: sampling one's way out of the Felsenstein zone. , 2005, Molecular biology and evolution.

[101]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[102]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[103]  D. Soltis,et al.  Phylogeny and domain evolution in the APETALA2-like gene family. , 2006, Molecular biology and evolution.

[104]  Michael J. Sanderson,et al.  Paloverde: an OpenGL 3D phylogeny browser , 2006, Bioinform..

[105]  T. J. Edwards,et al.  Legumes of the World , 2007 .

[106]  W. Kress,et al.  rbcL and Legume Phylogeny, with Particular Reference to Phaseoleae, Millettieae, and Allies , 2008 .