论文信息 - Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits

Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits

Correct orthology assignment is a critical prerequisite of numerous comparative genomics procedures, such as function prediction, construction of phylogenetic species trees and genome rearrangement analysis. We present an algorithm for the detection of non-orthologs that arise by mistake in current orthology classification methods based on genome-specific best hits, such as the COGs database. The algorithm works with pairwise distance estimates, rather than computationally expensive and error-prone tree-building methods. The accuracy of the algorithm is evaluated through verification of the distribution of predicted cases, case-by-case phylogenetic analysis and comparisons with predictions from other projects using independent methods. Our results show that a very significant fraction of the COG groups include non-orthologs: using conservative parameters, the algorithm detects non-orthology in a third of all COG groups. Consequently, sequence analysis sensitive to correct orthology assignments will greatly benefit from these findings.

Alexander C. J. Roth | G. Gonnet | B. Boeckmann | C. Dessimoz

[1] C. Pál,et al. Adaptive evolution of bacterial metabolic networks by horizontal gene transfer , 2005, Nature Genetics.

[2] E. Koonin. Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[3] Gaston H. Gonnet,et al. OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements , 2005, Comparative Genomics.

[4] O. Uhlenbeck,et al. Escherichia coli DbpA is a 3' --> 5' RNA helicase. , 2005, Biochemistry.

[5] E. Koonin. Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[6] M. P. Cummings. PHYLIP (Phylogeny Inference Package) , 2004 .

[7] Robert C. Edgar,et al. MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[8] J. Lawrence,et al. Lateral gene transfer: when will adolescence end? , 2003, Molecular microbiology.

[9] Darren A. Natale,et al. The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[10] John P. Huelsenbeck,et al. MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[11] Rodrigo Lopez,et al. Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[12] M. Dreyfus,et al. The DEAD‐box RNA helicase SrmB is involved in the assembly of 50S ribosomal subunits in Escherichia coli , 2003, Molecular microbiology.

[13] Anne-Lise Veuthey,et al. Automated annotation of microbial proteomes in SWISS-PROT , 2003, Comput. Biol. Chem..

[14] Maria Jesus Martin,et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[15] C. Notredame,et al. Tcoffee add igs: a web server for computing, evaluating and combining multiple sequence alignments , 2003, Nucleic Acids Res..

[16] G. Pertea,et al. Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). , 2002, Genome research.

[17] A. J. Carpousis. The Escherichia coli RNA degradosome: structure, function and relationship in other ribonucleolytic multienzyme complexes. , 2001, Biochemical Society transactions.

[18] Christian E. V. Storm,et al. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[19] M. Kanehisa,et al. Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. , 2000, Nucleic acids research.

[20] M. Gouy,et al. HOBACGEN: database system for comparative genomics in bacteria. , 2000, Genome research.

[21] Gaston H. Gonnet,et al. Darwin v. 2.0: an interpreted computer language for the biosciences , 2000, Bioinform..

[22] Martin Vingron,et al. Modeling Amino Acid Replacement , 2000, J. Comput. Biol..

[23] D. Lipman,et al. A genomic perspective on protein families. , 1997, Science.

[24] O Gascuel,et al. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[25] M. Inouye,et al. Cold shock induces a major ribosomal-associated protein that unwinds double-stranded RNA in Escherichia coli. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[26] H. Ohmori. Structural analysis of the rhlE gene of Escherichia coli. , 1994, Idengaku zasshi.

[27] R F Doolittle,et al. Convergent evolution: the need to be explicit. , 1994, Trends in biochemical sciences.

[28] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[29] G. Moore,et al. Fitting the gene lineage into its species lineage , 1979 .

[30] W. Fitch. Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.