OrthologID: automation of genome-scale ortholog identification within a parsimony framework

MOTIVATION The determination of gene orthology is a prerequisite for mining and utilizing the rapidly increasing amount of sequence data for genome-scale phylogenetics and comparative genomic studies. Until now, most researchers use pairwise distance comparisons algorithms, such as BLAST, COG, RBH, RSD and INPARANOID, to determine gene orthology. In contrast, orthology determination within a character-based phylogenetic framework has not been utilized on a genomic scale owing to the lack of efficiency and automation. RESULTS We have developed OrthologID, a Web application that automates the labor-intensive procedures of gene orthology determination within a character-based phylogenetic framework, thus making character-based orthology determination on a genomic scale possible. In addition to generating gene family trees and determining orthologous gene sets for complete genomes, OrthologID can also identify diagnostic characters that define each orthologous gene set, as well as diagnostic characters that are responsible for classifying query sequences from other genomes into specific orthology groups. The OrthologID database currently includes several complete plant genomes, including Arabidopsis thaliana, Oryza sativa, Populus trichocarpa, as well as a unicellular outgroup, Chlamydomonas reinhardtii. To improve the general utility of OrthologID beyond plant species, we plan to expand our sequence database to include the fully sequenced genomes of prokaryotes and other non-plant eukaryotes. AVAILABILITY http://nypg.bio.nyu.edu/orthologid/

[1]  K Theodorides,et al.  Comparison of EST libraries from seven beetle species: towards a framework for phylogenomics of the Coleoptera , 2002, Insect molecular biology.

[2]  R DeSalle,et al.  Alignment-ambiguous nucleotide sites and the exclusion of systematic data. , 1993, Molecular phylogenetics and evolution.

[3]  D. P. Wall,et al.  Detecting putative orthologs , 2003, Bioinform..

[4]  A. Smith,et al.  Rooting molecular trees: problems and strategies , 1994 .

[5]  M. Miyamoto,et al.  CONSENSUS CLADOGRAMS AND GENERAL CLASSIFICATIONS , 1985, Cladistics : the international journal of the Willi Hennig Society.

[6]  M. Milinkovitch,et al.  Stability of cladistic relationships between Cetacea and higher-level artiodactyl taxa. , 1999, Systematic biology.

[7]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[8]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[9]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[10]  P. Arctander,et al.  Hidden morphological support for the phylogenetic placement of Pseudoryx nghetinhensis with bovine bovids: a combined analysis of gross anatomical evidence and DNA sequences from five genes. , 2000, Systematic biology.

[11]  Rob DeSalle,et al.  Combined support for wholesale taxic atavism in gavialine crocodylians. , 2003, Systematic biology.

[12]  T. J. Robinson,et al.  A molecular supermatrix of the rabbits and hares (Leporidae) allows for the identification of five intercontinental exchanges during the Miocene. , 2004, Systematic biology.

[13]  Joseph Felsenstein,et al.  The number of evolutionary trees , 1978 .

[14]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[15]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[16]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  S. Carroll,et al.  More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. , 2005, Molecular biology and evolution.

[19]  James M. Carpenter,et al.  ON SIMULTANEOUS ANALYSIS , 1996, Cladistics : the international journal of the Willi Hennig Society.

[20]  Rob DeSalle,et al.  Resolution of a supertree/supermatrix paradox. , 2002, Systematic biology.

[21]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[22]  W C Wheeler,et al.  Elision: a method for accommodating multiple molecular sequence alignments with alignment-ambiguous sites. , 1995, Molecular phylogenetics and evolution.

[23]  A. Kluge A Concern for Evidence and a Phylogenetic Hypothesis of Relationships among Epicrates (Boidae, Serpentes) , 1989 .

[24]  A. Kluge,et al.  Testability and the Refutation and Corroboration of Cladistic Hypotheses , 1997, Cladistics : the international journal of the Willi Hennig Society.

[25]  K. Nixon The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis , 1999 .

[26]  S. Rudd Expressed sequence tags: alternative or complement to whole genome sequences? , 2003, Trends in plant science.

[27]  R. Wayne,et al.  A molecular phylogeny of the Canidae based on six nuclear loci. , 2005, Molecular phylogenetics and evolution.

[28]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[29]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[30]  Arno Steinacher,et al.  Phylogeny of pholcid spiders (Araneae: Pholcidae): combined analysis using morphology and molecules. , 2005, Molecular phylogenetics and evolution.

[31]  Rob DeSalle,et al.  An automated phylogenetic key for classifying homeoboxes , 2002 .

[32]  K. Nixon,et al.  ON OUTGROUPS , 1993, Cladistics : the international journal of the Willi Hennig Society.

[33]  Richard W McCombie,et al.  Expressed sequence tag analysis in Cycas, the most primitive living seed plant , 2003, Genome Biology.

[34]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[35]  A. E. Hirsh,et al.  Protein dispensability and rate of evolution , 2001, Nature.

[36]  Niklas Wahlberg,et al.  Synergistic effects of combining morphological and molecular data in resolving the phylogeny of butterflies and skippers , 2005, Proceedings of the Royal Society B: Biological Sciences.

[37]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[38]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[39]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[40]  Richard G. Olmstead,et al.  Combining Data in Phylogenetic Systematics: An Empirical Approach Using Three Molecular Data Sets in the Solanaceae , 1994 .

[41]  E. Koonin,et al.  Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. , 2002, Genome research.

[42]  John J. Wiens,et al.  Weighting, Partitioning, and Combining Characters in Phylogenetic Analysis , 1994 .

[43]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..