Computational methods for Gene Orthology inference

Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple 'tree-like' mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.

[1]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[2]  D. P. Wall,et al.  Detecting putative orthologs , 2003, Bioinform..

[3]  Ting-Wen Chen,et al.  DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection , 2010, BMC Bioinformatics.

[4]  A. Hughes,et al.  Differential loss of ancestral gene families as a source of genomic divergence in animals , 2004, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[5]  Peer Bork,et al.  Consistency of genome‐based methods in measuring Metazoan evolution , 2005, FEBS letters.

[6]  W. Doolittle,et al.  Lateral gene transfer and the origins of prokaryotic groups. , 2003, Annual review of genetics.

[7]  E. Koonin,et al.  Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world , 2008, Nucleic acids research.

[8]  Gaston H. Gonnet,et al.  Algorithm of OMA for large-scale orthology inference , 2008, BMC Bioinformatics.

[9]  Tal Dagan,et al.  Modular networks and cumulative impact of lateral transfer in prokaryote genome evolution , 2008, Proceedings of the National Academy of Sciences.

[10]  Lavanya Kannan,et al.  A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches , 2010, Bioinform..

[11]  Leonid Peshkin,et al.  Roundup: a multi-genome repository of orthologs and evolutionary distances , 2006, Bioinform..

[12]  Sean R. Eddy,et al.  RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs , 2002, BMC Bioinformatics.

[13]  W. Fitch Homology a personal view on some of the problems. , 2000, Trends in genetics : TIG.

[14]  J. Townsend,et al.  Horizontal gene transfer, genome innovation and evolution , 2005, Nature Reviews Microbiology.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Michael Schroeder,et al.  Equivalent binding sites reveal convergently evolved interaction motifs , 2006, Bioinform..

[17]  Christine Sacerdot,et al.  Insertion of Horizontally Transferred Genes within Conserved Syntenic Regions of Yeast Genomes , 2009, PloS one.

[18]  Michael Y. Galperin,et al.  Non-homologous isofunctional enzymes: A systematic analysis of alternative solutions in enzyme evolution , 2010, Biology Direct.

[19]  A. Sali,et al.  Evolutionary constraints on structural similarity in orthologs and paralogs , 2009, Protein science : a publication of the Protein Society.

[20]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[21]  Bonnie Berger,et al.  IsoBase: a database of functionally related proteins across PPI networks , 2010, Nucleic Acids Res..

[22]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[23]  Christophe Perin,et al.  Phylogenomics of plant genomes: a methodology for genome-wide searches for orthologs in plants , 2008, BMC Genomics.

[24]  Eugene V Koonin,et al.  Connected gene neighborhoods in prokaryotic genomes. , 2002, Nucleic acids research.

[25]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Erik L. L. Sonnhammer,et al.  Automated ortholog inference from phylogenetic trees and calculation of orthology reliability , 2002, Bioinform..

[27]  Kris Popendorf,et al.  Accurate identification of orthologous segments among multiple genomes , 2009, Bioinform..

[28]  Albert J. Vilella,et al.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. , 2009, Genome research.

[29]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[30]  T. Gabaldón Large-scale assignment of orthology: back to phylogenetics? , 2008, Genome Biology.

[31]  Gaston H. Gonnet,et al.  OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements , 2005, Comparative Genomics.

[32]  Ilya B. Muchnik,et al.  A Biologically Consistent Model for Comparing Molecular Phylogenies , 1995, J. Comput. Biol..

[33]  Paramvir S. Dehal,et al.  Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate , 2005, PLoS biology.

[34]  G. Petsko My worries are no longer behind me , 2007, Genome Biology.

[35]  Matthew W. Hahn,et al.  Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution , 2007, Genome Biology.

[36]  P. Albert,et al.  A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error without a Gold Standard , 2004, Biometrics.

[37]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[38]  Natalya Yutin,et al.  Eukaryotic large nucleo-cytoplasmic DNA viruses: Clusters of orthologous genes and reconstruction of viral genome evolution , 2009, Virology Journal.

[39]  Christophe Dessimoz,et al.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods , 2009, PLoS Comput. Biol..

[40]  Salvador Capella-Gutiérrez,et al.  PhylomeDB v3.0: an expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions , 2010, Nucleic Acids Res..

[41]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[42]  Matthew R. Laird,et al.  Improving the specificity of high-throughput ortholog prediction , 2006, BMC Bioinformatics.

[43]  Leo Goodstadt,et al.  Phylogenetic Reconstruction of Orthology, Paralogy, and Conserved Synteny for Dog and Human , 2006, PLoS Comput. Biol..

[44]  M. Lynch,et al.  The altered evolutionary trajectories of gene duplicates. , 2004, Trends in genetics : TIG.

[45]  François-Joseph Lapointe,et al.  Harvesting evolutionary signals in a forest of prokaryotic gene trees. , 2011, Molecular biology and evolution.

[46]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[47]  Kevin P. Byrne,et al.  The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. , 2005, Genome research.

[48]  Peer Bork,et al.  Comparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages. , 2005, Genome research.

[49]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[50]  Jaime Mora,et al.  argC Orthologs from Rhizobiales Show Diverse Profiles of Transcriptional Efficiency and Functionality in Sinorhizobium meliloti , 2010, Journal of bacteriology.

[51]  Nevin D. Young,et al.  OrthoParaMap: Distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies , 2003, BMC Bioinformatics.

[52]  E. Koonin,et al.  Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea , 2007, Biology Direct.

[53]  Michael Y. Galperin,et al.  Functional genomics and enzyme evolution , 2004, Genetica.

[54]  C. Randal Linder,et al.  Multiple sequence alignment: a major challenge to large-scale phylogenetics , 2011, PLoS currents.

[55]  Inna Dubchak,et al.  Trends in Prokaryotic Evolution Revealed by Comparison of Closely Related Bacterial and Archaeal Genomes , 2008, Journal of bacteriology.

[56]  Rodrigo Lopez,et al.  WU-Blast2 server at the European Bioinformatics Institute , 2003, Nucleic Acids Res..

[57]  Erik L. L. Sonnhammer,et al.  InParanoid 7: new algorithms and tools for eukaryotic orthology analysis , 2009, Nucleic Acids Res..

[58]  Avi Pfeffer,et al.  Automatic genome-wide reconstruction of phylogenetic gene trees , 2007, ISMB/ECCB.

[59]  Michael Y. Galperin,et al.  The cyanobacterial genome core and the origin of photosynthesis , 2006, Proceedings of the National Academy of Sciences.

[60]  P. Bork,et al.  Quantification of insect genome divergence. , 2007, Trends in genetics : TIG.

[61]  E. Koonin,et al.  Horizontal gene transfer in prokaryotes: quantification and classification. , 2001, Annual review of microbiology.

[62]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[63]  Guy Perrière,et al.  Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases , 2005, Bioinform..

[64]  S. Dongen Graph clustering by flow simulation , 2000 .

[65]  S. Pongor,et al.  The quest for orthologs: finding the corresponding gene across genomes. , 2008, Trends in genetics : TIG.

[66]  A. Elofsson,et al.  Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. , 2005, Journal of molecular biology.

[67]  R. Page,et al.  From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. , 1997, Molecular phylogenetics and evolution.

[68]  Dr. Susumu Ohno Evolution by Gene Duplication , 1970, Springer Berlin Heidelberg.

[69]  N. King,et al.  The unicellular ancestry of animal development. , 2004, Developmental cell.

[70]  E. Koonin,et al.  The Tree and Net Components of Prokaryote Evolution , 2010, Genome biology and evolution.

[71]  Katherine H. Huang,et al.  Comparative genomics of the lactic acid bacteria , 2006, Proceedings of the National Academy of Sciences.

[72]  B. Birren,et al.  Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae , 2004, Nature.

[73]  Eugene V Koonin,et al.  Evolution of genome architecture. , 2009, The international journal of biochemistry & cell biology.

[74]  Laurent Duret,et al.  Differential retention of metabolic genes following whole-genome duplication. , 2009, Molecular biology and evolution.

[75]  Vasant Honavar,et al.  Detection of gene orthology from gene co-expression and protein interaction networks , 2010, BMC Bioinformatics.

[76]  E. Koonin,et al.  A minimal gene set for cellular life derived by comparison of complete bacterial genomes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[77]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[78]  Christian von Mering,et al.  eggNOG: automated construction and annotation of orthologous groups of genes , 2007, Nucleic Acids Res..

[79]  Boris G. Mirkin,et al.  Ancestral paralogs and pseudoparalogs and their role in the emergence of the eukaryotic cell , 2005, Nucleic acids research.

[80]  Eugene V Koonin,et al.  The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages , 2009, Proceedings of the National Academy of Sciences.

[81]  Maureen A. O’Malley,et al.  Prokaryotic evolution and the tree of life are two different things , 2009, Biology Direct.

[82]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[83]  Lorraine Olendzenski,et al.  Evolution of Genes and Organisms , 2009, Annals of the New York Academy of Sciences.

[84]  Eugene V. Koonin,et al.  Constraints and plasticity in genome and molecular-phenome evolution , 2010, Nature Reviews Genetics.

[85]  Kimmen Sjölander,et al.  Phylogenomic inference of protein molecular function: advances and challenges , 2004, Bioinform..

[86]  P. Bork,et al.  Non-orthologous gene displacement. , 1996, Trends in genetics : TIG.

[87]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[88]  Erik L. L. Sonnhammer,et al.  OrthoGUI: graphical presentation of Orthostrapper results , 2002, Bioinform..

[89]  Tao Jiang,et al.  MSOAR 2.0: Incorporating tandem duplications into ortholog assignment based on genome rearrangement , 2010, BMC Bioinformatics.

[90]  P. Bork,et al.  Measuring genome evolution. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[91]  Albert J. Vilella,et al.  Joining forces in the quest for orthologs , 2009, Genome Biology.

[92]  W. Doolittle,et al.  The practice of classification and the theory of evolution, and what the demise of Charles Darwin's tree of life hypothesis means for both of them , 2009, Philosophical Transactions of the Royal Society B: Biological Sciences.

[93]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[94]  Damian Szklarczyk,et al.  eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations , 2009, Nucleic Acids Res..

[95]  Christian E. V. Storm,et al.  Comprehensive analysis of orthologous protein domains using the HOPS database. , 2003, Genome research.

[96]  A. Mushegian,et al.  Evolutionarily Conserved Orthologous Families in Phages Are Relatively Rare in Their Prokaryotic Hosts , 2011, Journal of bacteriology.

[97]  J. Dopazo,et al.  The human phylome , 2007, Genome Biology.

[98]  Eugene V. Koonin,et al.  Comparative genomics, minimal gene-sets and the last universal common ancestor , 2003, Nature Reviews Microbiology.

[99]  Teresa M. Przytycka,et al.  COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations , 2006, Bioinform..

[100]  A. N. Spiridonov,et al.  Congruent evolution of different classes of non-coding DNA in prokaryotic genomes. , 2002, Nucleic acids research.

[101]  Eugene V Koonin,et al.  The fundamental units, processes and patterns of evolution, and the Tree of Life conundrum , 2009, Biology Direct.

[102]  Tao Liu,et al.  TreeFam: 2008 Update , 2007, Nucleic Acids Res..

[103]  E. Rocha,et al.  Horizontal Transfer, Not Duplication, Drives the Expansion of Protein Families in Prokaryotes , 2011, PLoS genetics.

[104]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[105]  Jianzhi Zhang,et al.  Evolutionary conservation of expression profiles between human and mouse orthologous genes. , 2006, Molecular biology and evolution.

[106]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[107]  M. Huynen,et al.  Benchmarking ortholog identification methods using functional genomics data , 2006, Genome Biology.

[108]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[109]  Sean R. Eddy,et al.  A simple algorithm to infer gene duplication and speciation events on a gene tree , 2001, Bioinform..

[110]  C. Hutchison,et al.  Gene content phylogeny of herpesviruses. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[111]  Olivier Poch,et al.  OrthoInspector: comprehensive orthology analysis and visual exploration , 2011, BMC Bioinformatics.

[112]  Berend Snel,et al.  Orthology prediction at scalable resolution by phylogenetic tree analysis , 2007, BMC Bioinformatics.

[113]  Feng Chen,et al.  OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups , 2005, Nucleic Acids Res..

[114]  W. Doolittle,et al.  Prokaryotic evolution in light of gene transfer. , 2002, Molecular biology and evolution.

[115]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[116]  S. Teichmann,et al.  Domain combinations in archaeal, eubacterial and eukaryotic proteomes. , 2001, Journal of molecular biology.

[117]  S. Hui,et al.  Evaluation of diagnostic tests without gold standards , 1998, Statistical methods in medical research.

[118]  I. Muchnik,et al.  Ortholog Clustering on a Multipartite Graph , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[119]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[120]  Inna Dubchak,et al.  ATGC: a database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes , 2008, Nucleic Acids Res..

[121]  Tao Jiang,et al.  Clustering of Main orthologs for Multiple genomes , 2008, J. Bioinform. Comput. Biol..

[122]  M. Suyama,et al.  Evolution of prokaryotic gene order: genome rearrangements in closely related species. , 2001, Trends in genetics : TIG.

[123]  I. Măndoiu,et al.  Identification of mammalian orthologs using local synteny , 2009, BMC Genomics.

[124]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[125]  Adam P. Arkin,et al.  FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix , 2009, Molecular biology and evolution.

[126]  E. Koonin,et al.  Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. , 2001, Genome research.

[127]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[128]  Gang Liu,et al.  Automatic clustering of orthologs and inparalogs shared by multiple proteomes , 2006, ISMB.

[129]  Gaston H. Gonnet,et al.  OMA 2011: orthology inference among 1000 complete genomes , 2010, Nucleic Acids Res..

[130]  Leszek P. Pryszcz,et al.  MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score , 2010, Nucleic acids research.

[131]  H Kishino,et al.  Freeing phylogenies from artifacts of alignment. , 1992, Molecular biology and evolution.

[132]  Kimmen Sjölander,et al.  Berkeley PHOG: PhyloFacts orthology group prediction web server , 2009, Nucleic Acids Res..

[133]  Hyrum Carroll,et al.  Analysis of long branch extraction and long branch shortening , 2010, BMC Genomics.