GIGA: a simple, efficient algorithm for gene tree inference in the genomic age

BackgroundPhylogenetic relationships between genes are not only of theoretical interest: they enable us to learn about human genes through the experimental work on their relatives in numerous model organisms from bacteria to fruit flies and mice. Yet the most commonly used computational algorithms for reconstructing gene trees can be inaccurate for numerous reasons, both algorithmic and biological. Additional information beyond gene sequence data has been shown to improve the accuracy of reconstructions, though at great computational cost.ResultsWe describe a simple, fast algorithm for inferring gene phylogenies, which makes use of information that was not available prior to the genomic age: namely, a reliable species tree spanning much of the tree of life, and knowledge of the complete complement of genes in a species' genome. The algorithm, called GIGA, constructs trees agglomeratively from a distance matrix representation of sequences, using simple rules to incorporate this genomic age information. GIGA makes use of a novel conceptualization of gene trees as being composed of orthologous subtrees (containing only speciation events), which are joined by other evolutionary events such as gene duplication or horizontal gene transfer. An important innovation in GIGA is that, at every step in the agglomeration process, the tree is interpreted/reinterpreted in terms of the evolutionary events that created it. Remarkably, GIGA performs well even when using a very simple distance metric (pairwise sequence differences) and no distance averaging over clades during the tree construction process.ConclusionsGIGA is efficient, allowing phylogenetic reconstruction of very large gene families and determination of orthologs on a large scale. It is exceptionally robust to adding more gene sequences, opening up the possibility of creating stable identifiers for referring to not only extant genes, but also their common ancestors. We compared trees produced by GIGA to those in the TreeFam database, and they were very similar in general, with most differences likely due to poor alignment quality. However, some remaining differences are algorithmic, and can be explained by the fact that GIGA tends to put a larger emphasis on minimizing gene duplication and deletion events.

[1]  W. Maddison,et al.  Inferring phylogeny despite incomplete lineage sorting. , 2006, Systematic biology.

[2]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[3]  G. Olsen,et al.  Ribosomal RNA: a key to phylogeny , 1993, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[4]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[5]  B. Birren,et al.  Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae , 2004, Nature.

[6]  B. Rannala,et al.  Phylogenetic inference using whole genomes. , 2008, Annual review of genomics and human genetics.

[7]  S. Jeffery Evolution of Protein Molecules , 1979 .

[8]  Christophe Dessimoz,et al.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods , 2009, PLoS Comput. Biol..

[9]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  T. Tuller,et al.  Inferring phylogenetic networks by the maximum parsimony criterion: a case study. , 2006, Molecular biology and evolution.

[11]  C Sander,et al.  Predicting protein structure using hidden Markov models , 1997, Proteins.

[12]  G. Moore,et al.  Descent of mammalian alpha globin chain sequences investigated by the maximum parsimony method. , 1972, Journal of molecular biology.

[13]  K. Holsinger The neutral theory of molecular evolution , 2004 .

[14]  Lincoln Stein,et al.  nGASP – the nematode genome annotation assessment project , 2008, BMC Bioinformatics.

[15]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[16]  Alan M. Moses,et al.  Widespread Discordance of Gene Trees with Species Tree in Drosophila: Evidence for Incomplete Lineage Sorting , 2006, PLoS genetics.

[17]  Joaquín Dopazo,et al.  PhylomeDB: a database for genome-wide collections of gene phylogenies , 2007, Nucleic Acids Res..

[18]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[19]  Leon Goldovsky,et al.  The net of life: reconstructing the microbial phylogenetic network. , 2005, Genome research.

[20]  Toni Gabaldón,et al.  The Tree versus the Forest: The Fungal Tree of Life and the Topological Diversity within the Yeast Phylome , 2009, PloS one.

[21]  Avi Pfeffer,et al.  Automatic genome-wide reconstruction of phylogenetic gene trees , 2007, ISMB/ECCB.

[22]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[23]  Matthew J. Betts,et al.  Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony , 2006, Journal of Molecular Evolution.

[24]  Nicholas Hamilton,et al.  Phylogenetic identification of lateral genetic transfer events , 2006, BMC Evolutionary Biology.

[25]  Matthew D. Rasmussen,et al.  Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes. , 2007, Genome research.

[26]  Tero Aittokallio,et al.  Model-based prediction of sequence alignment quality , 2008, Bioinform..

[27]  Albert J. Vilella,et al.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. , 2009, Genome research.

[28]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[29]  Dannie Durand,et al.  A Hybrid Micro-Macroevolutionary Approach to Gene Tree Reconstruction , 2005, RECOMB.

[30]  Morris Goodman,et al.  Phylogenetic origins and adaptive evolution of avian and mammalian haemoglobin genes , 1982, Nature.

[31]  Li Ni,et al.  The Gene Ontology's Reference Genome Project: A Unified Framework for Functional Annotation across Species , 2009, PLoS Comput. Biol..

[32]  Tao Liu,et al.  TreeFam: 2008 Update , 2007, Nucleic Acids Res..

[33]  Dannie Durand,et al.  NOTUNG: A Program for Dating Gene Duplications and Optimizing Gene Family Trees , 2000, J. Comput. Biol..

[34]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[35]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[36]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[37]  Allan C. Wilson,et al.  Construction of phylogenetic trees for proteins and nucleic acids: Empirical evaluation of alternative matrix methods , 1978, Journal of Molecular Evolution.

[38]  T. Jukes,et al.  The neutral theory of molecular evolution. , 2000, Genetics.

[39]  Alexander C. J. Roth,et al.  Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits , 2006, Nucleic acids research.

[40]  Yuying Tian,et al.  GeneTrees: a phylogenomics resource for prokaryotes , 2006, Nucleic Acids Res..

[41]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[42]  Anushya Muruganujan,et al.  PANTHER version 7: improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium , 2009, Nucleic Acids Res..

[43]  M. Lynch,et al.  The altered evolutionary trajectories of gene duplicates. , 2004, Trends in genetics : TIG.

[44]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.