Pervasiveness of Gene Conservation and Persistence of Duplicates in Cellular Genomes

Abstract. In this work detailed statistics on ancestral gene duplication and gene conservation in completely sequenced cellular genomes are presented. Analysis of open reading frame (ORF) products having simultaneous matches in several distinct organisms showed a significant correlation between duplication and conservation. Systematic comparisons of predicted proteomes of 23 organisms (including 20 that have been completely sequenced), have allowed us to quantify the degree of ancestral duplication within each genome and the level of conservation between genomes, using threshold values calculated for individual organisms. Statistical analysis of various gene proportions revealed interesting trends in gene structure and evolution, such as that (a) more than one-quarter (25%–66%) of the predicted ORF products of the surveyed organisms are not unique, indicating a high level of ancestral duplications; (b) levels of exclusive conservation within Bacteria are higher than those within the eukaryal or archaeal domains; and (c) at least one-half (47–99%) of the total predicted ORF products in the surveyed genomes have one or several highly significant matches in another genome. Significant matches are based on simulations taking into account the mean size of ORF products and the composition of each target organism's proteome. The methodology we have developed ensures stability and comparability of our results as the number of completely sequenced genomes increases.

[1]  André Goffeau,et al.  The yeast genome directory. , 1997, Nature.

[2]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[3]  S. Salzberg,et al.  Complete genome sequence of Treponema pallidum, the syphilis spirochete. , 1998, Science.

[4]  R. W. Davis,et al.  Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis. , 1998, Science.

[5]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[6]  G. Church,et al.  Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics , 1997, Journal of bacteriology.

[7]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[8]  R. Durbin,et al.  Analysis of protein domain families in Caenorhabditis elegans. , 1997, Genomics.

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  S. Salzberg,et al.  Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi , 1997, Nature.

[11]  M. Sternberg Protein Structure Prediction: A Practical Approach , 1997 .

[12]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[13]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[14]  C. Sensen,et al.  Complete DNA sequence of yeast chromosome XI , 1994, Nature.

[15]  R. Fleischmann,et al.  Complete Genome Sequence of the Methanogenic Archaeon, Methanococcus jannaschii , 1996, Science.

[16]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[17]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[18]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[19]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[20]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[21]  Russell F. Doolittle,et al.  Microbial genomes opened up , 1998, Nature.

[22]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[23]  H. Hilbert,et al.  Comparative analysis of the genomes of the bacteria Mycoplasma pneumoniae and Mycoplasma genitalium. , 1997, Nucleic acids research.

[24]  K. Novak The complete genome sequence… , 1998, Nature Medicine.

[25]  R. Fleischmann,et al.  The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus , 1997, Nature.

[26]  Y. Nakamura,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions (supplement). , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[27]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[28]  T. Sicheritz-Pontén,et al.  The genome sequence of Rickettsia prowazekii and the origin of mitochondria , 1998, Nature.

[29]  M. Riley,et al.  Protein evolution viewed through Escherichia coli protein sequences: introducing the notion of a structural segment of homology, the module. , 1997, Journal of molecular biology.

[30]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[31]  Sayaka,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[32]  E V Koonin,et al.  Sequence similarity analysis of Escherichia coli proteins: functional and evolutionary implications. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[33]  R. Huber,et al.  The complete genome of the hyperthermophilic bacterium Aquifex aeolicus , 1998, Nature.

[34]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[35]  F. Robb,et al.  Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3. , 1998, DNA research : an international journal for rapid publication of reports on genes and genomes.

[36]  H. Hilbert,et al.  Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. , 1996, Nucleic acids research.