Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits

Correct orthology assignment is a critical prerequisite of numerous comparative genomics procedures, such as function prediction, construction of phylogenetic species trees and genome rearrangement analysis. We present an algorithm for the detection of non-orthologs that arise by mistake in current orthology classification methods based on genome-specific best hits, such as the COGs database. The algorithm works with pairwise distance estimates, rather than computationally expensive and error-prone tree-building methods. The accuracy of the algorithm is evaluated through verification of the distribution of predicted cases, case-by-case phylogenetic analysis and comparisons with predictions from other projects using independent methods. Our results show that a very significant fraction of the COG groups include non-orthologs: using conservative parameters, the algorithm detects non-orthology in a third of all COG groups. Consequently, sequence analysis sensitive to correct orthology assignments will greatly benefit from these findings.

[1]  C. Pál,et al.  Adaptive evolution of bacterial metabolic networks by horizontal gene transfer , 2005, Nature Genetics.

[2]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[3]  Gaston H. Gonnet,et al.  OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements , 2005, Comparative Genomics.

[4]  O. Uhlenbeck,et al.  Escherichia coli DbpA is a 3' --> 5' RNA helicase. , 2005, Biochemistry.

[5]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[6]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[7]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[8]  J. Lawrence,et al.  Lateral gene transfer: when will adolescence end? , 2003, Molecular microbiology.

[9]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[10]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[11]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[12]  M. Dreyfus,et al.  The DEAD‐box RNA helicase SrmB is involved in the assembly of 50S ribosomal subunits in Escherichia coli , 2003, Molecular microbiology.

[13]  Anne-Lise Veuthey,et al.  Automated annotation of microbial proteomes in SWISS-PROT , 2003, Comput. Biol. Chem..

[14]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[15]  C. Notredame,et al.  Tcoffee add igs: a web server for computing, evaluating and combining multiple sequence alignments , 2003, Nucleic Acids Res..

[16]  G. Pertea,et al.  Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). , 2002, Genome research.

[17]  A. J. Carpousis The Escherichia coli RNA degradosome: structure, function and relationship in other ribonucleolytic multienzyme complexes. , 2001, Biochemical Society transactions.

[18]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[19]  M. Kanehisa,et al.  Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. , 2000, Nucleic acids research.

[20]  M. Gouy,et al.  HOBACGEN: database system for comparative genomics in bacteria. , 2000, Genome research.

[21]  Gaston H. Gonnet,et al.  Darwin v. 2.0: an interpreted computer language for the biosciences , 2000, Bioinform..

[22]  Martin Vingron,et al.  Modeling Amino Acid Replacement , 2000, J. Comput. Biol..

[23]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[24]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[25]  M. Inouye,et al.  Cold shock induces a major ribosomal-associated protein that unwinds double-stranded RNA in Escherichia coli. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[26]  H. Ohmori Structural analysis of the rhlE gene of Escherichia coli. , 1994, Idengaku zasshi.

[27]  R F Doolittle,et al.  Convergent evolution: the need to be explicit. , 1994, Trends in biochemical sciences.

[28]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[29]  G. Moore,et al.  Fitting the gene lineage into its species lineage , 1979 .

[30]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.