The PhyLoTA Browser: processing GenBank for molecular phylogenetics research.

As an archive of sequence data for over 165,000 species, GenBank is an indispensable resource for phylogenetic inference. Here we describe an informatics processing pipeline and online database, the PhyLoTA Browser (http://loco.biosci.arizona.edu/pb), which offers a view of GenBank tailored for molecular phylogenetics. The first release of the Browser is computed from 2.6 million sequences representing the taxonomically enriched subset of GenBank sequences for eukaryotes (excluding most genome survey sequences, ESTs, and other high-throughput data). In addition to summarizing sequence diversity and species diversity across nodes in the NCBI taxonomy, it reports 87,000 potentially phylogenetically informative clusters of homologous sequences, which can be viewed or downloaded, along with provisional alignments and coarse phylogenetic trees. At each node in the NCBI hierarchy, the user can display a "data availability matrix" of all available sequences for entries in a subtaxa-by-clusters matrix. This matrix provides a guidepost for subsequent assembly of multigene data sets or supertrees. The database allows for comparison of results from previous GenBank releases, highlighting recent additions of either sequences or taxa to GenBank and letting investigators track progress on data availability worldwide. Although the reported alignments and trees are extremely approximate, the database reports several statistics correlated with alignment quality to help users choose from alternative data sources.

[1]  Hideo Matsuda,et al.  Classifying Molecular Sequences Using a Linkage Graph With Their Pairwise Similarities , 1999, Theor. Comput. Sci..

[2]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[3]  D. P. Wall,et al.  Detecting putative orthologs , 2003, Bioinform..

[4]  Guoqing Lu,et al.  A practical approach to phylogenomics: the phylogeny of ray-finned fish (Actinopterygii) as a case study , 2007, BMC Evolutionary Biology.

[5]  M. Sanderson,et al.  Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. , 2006, Systematic biology.

[6]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[7]  Roderic D. M. Page,et al.  A Taxonomic Search Engine: Federating taxonomic databases using web services , 2005, BMC Bioinformatics.

[8]  Rytas Vilgalys,et al.  Taxonomic misidentification in public DNA databases. , 2003, The New phytologist.

[9]  S. Federhen The Taxonomy Project , 2002 .

[10]  G. Moore,et al.  Fitting the gene lineage into its species lineage , 1979 .

[11]  Yuying Tian,et al.  GeneTrees: a phylogenomics resource for prokaryotes , 2006, Nucleic Acids Res..

[12]  Todd J. Vision,et al.  Phytome: a platform for plant comparative genomics , 2005, Nucleic Acids Res..

[13]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[14]  Wei Qian,et al.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. , 2000, Molecular biology and evolution.

[15]  C. Cunningham,et al.  Can three incongruence tests predict when data should be combined? , 1997, Molecular biology and evolution.

[16]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[17]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[18]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[19]  J. Badger,et al.  Probabilistic Analysis Indicates Discordant Gene Trees in Chloroplast Evolution , 2003, Journal of Molecular Evolution.

[20]  J. Cotton Analytical methods for detecting paralogy in molecular datasets. , 2005, Methods in enzymology.

[21]  Mikkel Thorup,et al.  On the Agreement of Many Trees , 1995, Inf. Process. Lett..

[22]  Sean R. Eddy,et al.  A simple algorithm to infer gene duplication and speciation events on a gene tree , 2001, Bioinform..

[23]  Kate E. Jones,et al.  The delayed rise of present-day mammals , 1990, Nature.

[24]  R. Page,et al.  From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. , 1997, Molecular phylogenetics and evolution.

[25]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[26]  J. G. Burleigh,et al.  Identifying optimal incomplete phylogenetic data sets from sequence databases. , 2005, Molecular phylogenetics and evolution.

[27]  Gwilym P. Lewis,et al.  Legumes of the World , 2000 .

[28]  J. Ohn,et al.  Does Adding Characters with Missing Data Increase or Decrease Phylogenetic Accuracy ? , 2003 .

[29]  M J Sanderson,et al.  Improved bootstrap confidence limits in large-scale phylogenies, with an example from Neo-Astragalus (Leguminosae). , 2000, Systematic biology.

[30]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[31]  Martin Vingron,et al.  WWW access to the SYSTERS protein sequence cluster set , 1999, Bioinform..

[32]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[33]  S. Mathews,et al.  Assessing Among‐Locus Variation in the Inference of Seed Plant Phylogeny , 2007, International Journal of Plant Sciences.

[34]  Avi Pfeffer,et al.  Automatic genome-wide reconstruction of phylogenetic gene trees , 2007, ISMB/ECCB.

[35]  David Fernández-Baca,et al.  PhyloFinder: An intelligent search engine for phylogenetic tree databases , 2008, BMC Evolutionary Biology.

[36]  A. J. Jones,et al.  At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies , 2005, Applied and Environmental Microbiology.

[37]  Olivier Gascuel,et al.  Mathematics of Evolution and Phylogeny , 2005 .

[38]  Melanie A. Huntley,et al.  Evolution of genes and genomes on the Drosophila phylogeny , 2007, Nature.

[39]  Yu Zhang,et al.  An Eulerian path approach to local multiple alignment for DNA sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[40]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[41]  Zih E N G Ya N,et al.  On the Best Evolutionary Rate for Phylogenetic Analysis , 1998 .

[42]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[43]  Michael J Sanderson,et al.  The challenge of constructing large phylogenetic trees. , 2003, Trends in plant science.

[44]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[45]  Roderic D. M. Page,et al.  TBMap: a taxonomic perspective on the phylogenetic database TreeBASE , 2007, BMC Bioinformatics.

[46]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[47]  S. Poe Evaluation of the strategy of long-branch subdivision to improve the accuracy of phylogenetic methods. , 2003, Systematic biology.

[48]  J. G. Burleigh,et al.  Prospects for Building the Tree of Life from Large Sequence Databases , 2004, Science.

[49]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[50]  D. Soltis,et al.  Comparison of three methods for estimating internal support on phylogenetic trees. , 2000, Systematic biology.

[51]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[52]  Olivier Gascuel,et al.  Reconstructing evolution : new mathematical and computational advances , 2007 .

[53]  D. Maddison,et al.  NEXUS: an extensible file format for systematic information. , 1997, Systematic biology.

[54]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[55]  J. Wiens Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? , 2005, Systematic biology.

[56]  Derrick J. Zwickl,et al.  Is sparse taxon sampling a problem for phylogenetic inference? , 2003, Systematic biology.

[57]  L. Stein Integrating biological databases , 2003, Nature Reviews Genetics.

[58]  J. Kim,et al.  Scaling of Accuracy in Extremely Large Phylogenetic Trees , 2000, Pacific Symposium on Biocomputing.

[59]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[60]  Nico M. Franz,et al.  On the lack of good scientific reasons for the growing phylogeny/classification gap , 2005 .

[61]  M. Sanderson,et al.  Diversification rates in a temperate legume clade: Are there “so many species” of Astragalus (Fabaceae)? , 1996 .

[62]  Elchanan Mossel,et al.  How much can evolved characters tell us about the tree that generated them? , 2004, Mathematics of Evolution and Phylogeny.

[63]  M. Sanderson,et al.  Inferring angiosperm phylogeny from EST data with widespread gene duplication , 2007, BMC Evolutionary Biology.

[64]  J. C. Regier,et al.  More taxa or more characters revisited: combining data from nuclear protein-encoding genes for phylogenetic analyses of Noctuoidea (Insecta: Lepidoptera). , 2000, Systematic biology.

[65]  Bengt Sennblad,et al.  Bayesian gene/species tree reconciliation and orthology analysis using MCMC , 2003, ISMB.

[66]  Junhyong Kim,et al.  Separate Versus Combined Analysis of Phylogenetic Evidence , 1995 .

[67]  Oliver Eulenstein,et al.  Obtaining maximal concatenated phylogenetic data sets from large sequence databases. , 2003, Molecular biology and evolution.

[68]  Erik L. L. Sonnhammer,et al.  Automated ortholog inference from phylogenetic trees and calculation of orthology reliability , 2002, Bioinform..

[69]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[70]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[71]  Sudhir Kumar,et al.  Taxon sampling, bioinformatics, and phylogenomics. , 2003, Systematic biology.

[72]  H. Kishino,et al.  Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea , 1989, Journal of Molecular Evolution.