A genomic perspective on protein families.

In order to extract the maximum amount of information from the rapidly accumulating genome sequences, all conserved genes need to be classified according to their homologous relationships. Comparison of proteins encoded in seven complete genomes from five major phylogenetic lineages and elucidation of consistent patterns of sequence similarities allowed the delineation of 720 clusters of orthologous groups (COGs). Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three lineages. Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG. This relation automatically yields a number of functional predictions for poorly characterized genomes. The COGs comprise a framework for functional and evolutionary genome analysis.

[1]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[2]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[3]  J. Walker,et al.  Distantly related sequences in the alpha‐ and beta‐subunits of ATP synthase, myosin, kinases and other ATP‐requiring enzymes and a common nucleotide binding fold. , 1982, The EMBO journal.

[4]  Masasuke Yoshida,et al.  Evolution of the vacuolar H+-ATPase: implications for the origin of eukaryotes. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[5]  E. Koonin,et al.  Viral proteins containing the purine NTP-binding sequence pattern. , 1989, Nucleic acids research.

[6]  P. R. Sibbald,et al.  The P-loop--a common motif in ATP- and GTP-binding proteins. , 1990, Trends in biochemical sciences.

[7]  M. Gribskov,et al.  The sigma 70 family: sequence conservation and evolutionary relationships , 1992, Journal of bacteriology.

[8]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[9]  M. Riley,et al.  Functions of the gene products of Escherichia coli , 1993, Microbiological reviews.

[10]  M. Tsuda,et al.  Enhancement of serine-sensitivity by a gene encoding rhodanese-like protein in Escherichia coli. , 1994, Journal of biochemistry.

[11]  J. Ferry,et al.  A carbonic anhydrase from the archaeon Methanosarcina thermophila. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[12]  R. Overbeek,et al.  The winds of (evolutionary) change: breathing new life into microbiology , 1994 .

[13]  W. Fitch Uses for evolutionary trees. , 1995, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[14]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[15]  D. Cooper Retention of CD44 introns in bladder cancer: Understanding the alternative splicing of pre‐mRNA opens new insights into the pathogenesis of human cancers , 1995, The Journal of pathology.

[16]  T. Morris,et al.  Lipoic acid metabolism in Escherichia coli: the lplA and lipB genes define redundant pathways for ligation of lipoyl groups to apoprotein , 1995, Journal of bacteriology.

[17]  E. Koonin Multidomain organization of eukaryotic guanine nucleotide exchange translation initiation factor eIF‐2B subunits revealed by analysis of conserved sequence motifs , 1995, Protein science : a publication of the Protein Society.

[18]  Jinya Otsuka,et al.  A comprehensive representation of extensive similarity linkage between large numbers of proteins , 1995, Comput. Appl. Biosci..

[19]  C. Guthrie,et al.  Essential Yeast Protein with Unexpected Similarity to Subunits of Mammalian Cleavage and Polyadenylation Specificity Factor (CPSF) , 1996, Science.

[20]  Eugene V. Koonin,et al.  [18] Protein sequence comparison at genome scale , 1996 .

[21]  R. Doolittle,et al.  Determining Divergence Times of the Major Kingdoms of Living Organisms with a Protein Clock , 1996, Science.

[22]  E. Koonin,et al.  A minimal gene set for cellular life derived by comparison of complete bacterial genomes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[23]  P Bork,et al.  The protein phosphatase 2C (PP2C) superfamily: Detection of bacterial homologues , 1996, Protein science : a publication of the Protein Society.

[24]  H. Hilbert,et al.  Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. , 1996, Nucleic acids research.

[25]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[26]  P. Bork,et al.  Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli , 1996, Current Biology.

[27]  Arcady R. Mushegian,et al.  Sequencing and analysis of bacterial genomes , 1996, Current Biology.

[28]  Griffiths,et al.  Biomaterials and Granulomas , 1996, Methods.

[29]  H. Schindelin,et al.  A left‐hand beta‐helix revealed by the crystal structure of a carbonic anhydrase from the archaeon Methanosarcina thermophila. , 1996, The EMBO journal.

[30]  Thomas L. Madden,et al.  Applications of network BLAST server. , 1996, Methods in enzymology.

[31]  P. Bork,et al.  Non-orthologous gene displacement. , 1996, Trends in genetics : TIG.

[32]  A. Lupas Prediction and analysis of coiled-coil structures. , 1996, Methods in enzymology.

[33]  C. Woese Phylogenetic trees: Whither microbiology? , 1996, Current Biology.

[34]  E V Koonin,et al.  Complete genome sequences of cellular life forms: glimpses of theoretical evolutionary genomics. , 1996, Current opinion in genetics & development.

[35]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[36]  H. Hilbert,et al.  Comparative analysis of the genomes of the bacteria Mycoplasma pneumoniae and Mycoplasma genitalium. , 1997, Nucleic acids research.

[37]  J. Weiser,et al.  Decoration of lipopolysaccharide with phosphorylcholine: a phase-variable characteristic of Haemophilus influenzae , 1997, Infection and immunity.

[38]  S. Oliver,et al.  Erratum: Overview of the yeast genome , 1997, Nature.

[39]  J. Craig Venter,et al.  The first genome from the third domain of life , 1997, Nature.

[40]  Gary J Olsen,et al.  Archaeal Genomics: An Overview , 1997, Cell.

[41]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[42]  Michael Y. Galperin,et al.  Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea , 1997, Molecular microbiology.

[43]  N. Pace A molecular view of microbial diversity and the biosphere. , 1997, Science.

[44]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.