CONSTRUCTING THE TREEFAM DATABASE

TreeFam is a database of phylogenetic trees of gene families. It aims to develop a curated resource that presents the accurate evolutionary history of all animal gene families, as well as reliable orthologs and paralog assignment. In developing TreeFam, four novel algorithms were designed to improve the accuracy of tree building or to serve special needs for development. The first is a constrained neighbour-joining that efficiently adds new sequences to an existing tree while maintaining the original topology at the same time. This method is used to expand a seed tree to a full tree without losing any information added by manual curation. The second algorithm is a leaf reordering that orders the leaves of a tree according to the weights of leaves. When it is drawn as a picture, one tree can be displayed in different ways, depending on the order of leaves. This algorithm helps to display trees in a consistent algorithm and facilitates visual examination of trees, which is particularly helpful when comparing two trees. Thirdly, duplication and loss inference is fit into a more general theoretical framework and extended to allow for a multifurcated species tree. A fourth algorithm has also been developed, which is a new algorithm for merging trees. The tree merge algorithm itself is not a tree building algorithm, but it reconstructs an optimal tree from several trees that are built from an identical sequence set with different tree building methods. The resultant tree should combine the advantages of, and so outperform, all the candidates. This is shown to occur successfully in a large-scale benchmark presented in the last chapter. This benchmark is one of the few evaluations that are based on real data in the phylogenetics literature. It also highlights the fact that each tree-building algorithm has its own strength, although ML and parsimonious methods are slightly better in general.

[1]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[2]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[3]  J. Huelsenbeck The robustness of two phylogenetic methods: four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining. , 1995, Molecular biology and evolution.

[4]  L. Cavalli-Sforza,et al.  PHYLOGENETIC ANALYSIS: MODELS AND ESTIMATION PROCEDURES , 1967, Evolution; international journal of organic evolution.

[5]  Martin Vingron,et al.  TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing , 2002, Bioinform..

[6]  B. Hall Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. , 2005, Molecular biology and evolution.

[7]  M. Holder,et al.  Phylogeny estimation: traditional and Bayesian approaches , 2003, Nature Reviews Genetics.

[8]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[9]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[10]  Edward Susko,et al.  Likelihood, parsimony, and heterogeneous evolution. , 2005, Molecular biology and evolution.

[11]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[12]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[13]  Antonis Rokas,et al.  Comparing bootstrap and posterior probability values in the four-taxon case. , 2003, Systematic biology.

[14]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[15]  Tao Liu,et al.  TreeFam: a curated database of phylogenetic trees of animal gene families , 2005, Nucleic Acids Res..

[16]  Masatoshi Nei,et al.  Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Dawei Li,et al.  A Draft Sequence for the Genome of the Domesticated Silkworm ( Bombyx mori ) , 2004 .

[18]  Fred R. McMorris,et al.  Consensusn-trees , 1981 .

[19]  Matthew Berriman,et al.  GeneDB: a resource for prokaryotic and eukaryotic organisms , 2004, Nucleic Acids Res..

[20]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[21]  Paramvir S. Dehal,et al.  Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate , 2005, PLoS biology.

[22]  Hideo Matsuda,et al.  fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood , 1994, Comput. Appl. Biosci..

[23]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[24]  Peixiang Ni,et al.  Genes controlling seed dormancy and pre-harvest sprouting in a rice-wheat-barley comparison , 2004, Functional & Integrative Genomics.

[25]  Martin Vingron,et al.  The SYSTERS Protein Family Database in 2005 , 2004, Nucleic Acids Res..

[26]  Heng Li,et al.  A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. , 2004, Nature.

[27]  Temple F. Smith,et al.  Reconstruction of ancient molecular phylogeny. , 1996, Molecular phylogenetics and evolution.

[28]  Thomas Ludwig,et al.  RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees , 2005, Bioinform..

[29]  Frédéric Delsuc,et al.  Heterotachy and long-branch attraction in phylogenetics , 2005, BMC Evolutionary Biology.

[30]  Dawei Li,et al.  The Genomes of Oryza sativa: A History of Duplications , 2005, PLoS biology.

[31]  R. Page,et al.  From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. , 1997, Molecular phylogenetics and evolution.

[32]  J. Bull,et al.  Exceptional convergent evolution in a virus. , 1997, Genetics.

[33]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[34]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[35]  Kara Dolinski,et al.  Fungal BLAST and Model Organism BLASTP Best Hits: new comparison resources at the Saccharomyces Genome Database (SGD) , 2004, Nucleic Acids Res..

[36]  Martin Vingron,et al.  Duplication-Based Measures of Difference Between Gene and Species Trees , 1998, J. Comput. Biol..

[37]  Wen-Hsiung Li Unbiased estimation of the rates of synonymous and nonsynonymous substitution , 2006, Journal of Molecular Evolution.

[38]  Sean R. Eddy,et al.  ATV: display and manipulation of annotated phylogenetic , 2001, Bioinform..

[39]  C. Seoighe,et al.  Significantly different patterns of amino acid replacement after gene duplication as compared to after speciation. , 2003, Molecular biology and evolution.

[40]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[41]  Hiroaki Kitano,et al.  The PANTHER database of protein families, subfamilies, functions and pathways , 2004, Nucleic Acids Res..

[42]  Sean R. Eddy,et al.  A simple algorithm to infer gene duplication and speciation events on a gene tree , 2001, Bioinform..

[43]  Ram Samudrala,et al.  Mouse transcriptome: Neutral evolution of ‘non-coding’ complementary DNAs , 2004, Nature.

[44]  H. Philippe,et al.  Heterotachy, an important process of protein evolution. , 2002, Molecular biology and evolution.

[45]  Kimberly Van Auken,et al.  WormBase: a comprehensive data resource for Caenorhabditis biology and genomics , 2004, Nucleic Acids Res..

[46]  Andrei N Lupas,et al.  PhyloGenie: automated phylome generation and analysis. , 2004, Nucleic acids research.

[47]  R M May,et al.  The reconstructed evolutionary process. , 1994, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[48]  J. Huelsenbeck,et al.  Application and accuracy of molecular phylogenies. , 1994, Science.

[49]  A. Halpern,et al.  Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. , 2000, Molecular biology and evolution.

[50]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[51]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[52]  O. Gascuel,et al.  Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. , 2003, Molecular biology and evolution.

[53]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[54]  Bryan Kolaczkowski,et al.  Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous , 2004, Nature.

[55]  Lachlan James M. Coin,et al.  Improved techniques for the identification of pseudogenes , 2004, ISMB/ECCB.

[56]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[57]  J A Eisen,et al.  Microbial Genes in the Human Genome: Lateral Transfer or Gene Loss? , 2001, Science.

[58]  Dannie Durand,et al.  A Hybrid Micro-Macroevolutionary Approach to Gene Tree Reconstruction , 2005, RECOMB.

[59]  F J Ayala,et al.  Molecular clock mirages. , 1999, BioEssays : news and reviews in molecular, cellular and developmental biology.

[60]  Lei Gao,et al.  Test Data Sets and Evaluation of Gene Prediction Programs on the Rice Genome , 2005, Journal of Computer Science and Technology.

[61]  P. V. Haastert,et al.  Genomics: Genes lost during evolution , 2001, Nature.

[62]  Mark P. Simmons,et al.  How meaningful are Bayesian support values? , 2004, Molecular biology and evolution.

[63]  Peer Bork,et al.  Comparative Genome and Proteome Analysis of Anopheles gambiae and Drosophila melanogaster , 2002, Science.

[64]  R. Page Maps between trees and cladistic analysis of historical associations among genes , 1994 .

[65]  Bengt Sennblad,et al.  Bayesian gene/species tree reconciliation and orthology analysis using MCMC , 2003, ISMB.

[66]  B. Larget,et al.  Markov Chain Monte Carlo Algorithms for the Bayesian Analysis of Phylogenetic Trees , 2000 .

[67]  Y. Inagaki,et al.  Testing for differences in rates-across-sites distributions in phylogenetic subtrees. , 2002, Molecular biology and evolution.

[68]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[69]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[70]  Michael J. Stanhope,et al.  Phylogenetic analyses do not support horizontal gene transfers from bacteria to vertebrates , 2001, Nature.

[71]  B. Rannala,et al.  Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference , 1996, Journal of Molecular Evolution.

[72]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[73]  Ziheng Yang Phylogenetic analysis using parsimony and likelihood methods , 1996, Journal of Molecular Evolution.

[74]  Sudhindra R Gadagkar,et al.  Maximum likelihood outperforms maximum parsimony even when evolutionary rates are heterotachous. , 2005, Molecular biology and evolution.

[75]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[76]  F. Delsuc,et al.  Phylogenomics and the reconstruction of the tree of life , 2005, Nature Reviews Genetics.

[77]  Guy Perrière,et al.  Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases , 2005, Bioinform..

[78]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[79]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[80]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[81]  G. Moore,et al.  Fitting the gene lineage into its species lineage , 1979 .

[82]  S. Baldauf,et al.  Phylogeny for the faint of heart: a tutorial. , 2003, Trends in genetics : TIG.

[83]  M. Nei,et al.  Theoretical foundation of the minimum-evolution method of phylogenetic inference. , 1993, Molecular biology and evolution.

[84]  Arndt von Haeseler,et al.  Testing substitution models within a phylogenetic tree. , 2003, Molecular biology and evolution.

[85]  J. Cotton Vertebrate phylogenomics and gene family evolution. , 2003 .

[86]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[87]  Z. Yang,et al.  Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. , 2000, Molecular biology and evolution.

[88]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[89]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[90]  M. Nei,et al.  Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. , 1986, Molecular biology and evolution.

[91]  Jonathan P. Bollback,et al.  Inferring the root of a phylogenetic tree. , 2002, Systematic biology.

[92]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[93]  B. Rannala,et al.  Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method. , 1997, Molecular biology and evolution.

[94]  International Chicken Polymorphism Map Consortium Explorer A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms , 2012 .

[95]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[96]  J. Bull,et al.  Experimental phylogenetics: generation of a known phylogeny. , 1992, Science.

[97]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[98]  Bengt Sennblad,et al.  Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution , 2004, RECOMB.

[99]  L. Pauling,et al.  Molecules as documents of evolutionary history. , 1965, Journal of theoretical biology.

[100]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[101]  P. Lewis,et al.  Success of maximum likelihood phylogeny inference in the four-taxon case. , 1995, Molecular biology and evolution.

[102]  Lars Arvestad,et al.  Assessment of protein distance measures and tree-building methods for phylogenetic tree reconstruction. , 2005, Molecular biology and evolution.

[103]  J. Thompson,et al.  The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[104]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.