A simple algorithm to infer gene duplication and speciation events on a gene tree

MOTIVATION When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (that diverged by gene duplication), because duplication enables functional diversification. The utility of phylogenetic information in high-throughput genome annotation ('phylogenomics') is widely recognized, but existing approaches are either manual or indirect (e.g. not based on phylogenetic trees). Our goal is to automate phylogenomics using explicit phylogenetic inference. A necessary component is an algorithm to infer speciation and duplication events in a given gene tree. RESULTS We give an algorithm to infer speciation and duplication events on a gene tree by comparison to a trusted species tree. This algorithm has a worst-case running time of O(n(2)) which is inferior to two previous algorithms that are approximately O(n) for a gene tree of sequences. However, our algorithm is extremely simple, and its asymptotic worst case behavior is only realized on pathological data sets. We show empirically, using 1750 gene trees constructed from the Pfam protein family database, that it appears to be a practical (and often superior) algorithm for analyzing real gene trees. AVAILABILITY http://www.genetics.wustl.edu/eddy/forester.

[1]  G. Crabtree,et al.  Structure of the human gamma-fibrinogen gene. Alternate mRNA splicing near the 3' end of the gene produces gamma A and gamma B forms of gamma-fibrinogen. , 1984, The Journal of biological chemistry.

[2]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[3]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[4]  Alex Bateman,et al.  InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites , 2002, Briefings Bioinform..

[5]  M A Newton,et al.  Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods , 1999, Biometrics.

[6]  Temple F. Smith,et al.  Reconstruction of ancient molecular phylogeny. , 1996, Molecular phylogenetics and evolution.

[7]  Oliver Eulenstein Vorhersage von Genduplikationen und deren Entwicklung in der Evolution , 1999 .

[8]  Martin Vingron,et al.  Duplication-Based Measures of Difference Between Gene and Species Trees , 1998, J. Comput. Biol..

[9]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[10]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[11]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[12]  Dannie Durand,et al.  Notung: dating gene duplications using gene family trees , 2000, RECOMB '00.

[13]  R. Raff,et al.  Evidence for a clade of nematodes, arthropods and other moulting animals , 1997, Nature.

[14]  G M Edelman,et al.  A cDNA clone for cytotactin contains sequences similar to epidermal growth factor-like repeats and segments of fibronectin and fibrinogen. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Martin Vingron,et al.  On the Equivalence of Two Tree Mapping Measures , 1998, Discret. Appl. Math..

[16]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[17]  Uzi Vishkin,et al.  On Finding Lowest Common Ancestors: Simplification and Parallelization , 1988, AWOC.

[18]  Roderic D. M. Page,et al.  GeneTree: comparing gene and species phylogenies using reconciled trees , 1998, Bioinform..

[19]  Ilya B. Muchnik,et al.  A Biologically Consistent Model for Comparing Molecular Phylogenies , 1995, J. Comput. Biol..

[20]  Sean R. Eddy,et al.  ATV: display and manipulation of annotated phylogenetic , 2001, Bioinform..

[21]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[22]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[23]  J. Leibowitz,et al.  Association of mouse fibrinogen-like protein with murine hepatitis virus-induced prothrombinase activity , 1995, Journal of virology.

[24]  M. Tristem Molecular Evolution — A Phylogenetic Approach. , 2000, Heredity.

[25]  Michael A. Charleston,et al.  Reconciled trees and incongruent gene and species trees , 1996, Mathematical Hierarchies and Biology.

[26]  G M Rubin,et al.  Spacing differentiation in the developing Drosophila eye: a fibrinogen-related lateral inhibitor encoded by scabrous. , 1990, Science.

[27]  R. Page Maps between trees and cladistic analysis of historical associations among genes , 1994 .

[28]  F J Ayala,et al.  Estimation and interpretation of genetic distance in empirical studies. , 1982, Genetical research.

[29]  N. Pace,et al.  Perspectives on archaeal diversity, thermophily and monophyly from environmental rRNA sequences. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Louxin Zhang,et al.  On a Mirkin-Muchnik-Smith Conjecture for Comparing Molecular Phylogenies , 1997, J. Comput. Biol..

[31]  G. Moore,et al.  Fitting the gene lineage into its species lineage , 1979 .

[32]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.