Methods in comparative genomics: genome correspondence, gene identification and motif discovery

In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genome-wide comparative analysis allowed the identification of functionally important sequences, both coding and non-coding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90% of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change. We present methods for the identification of protein-coding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10% of previously annotated genes) and refi ning the gene structure of hundreds of genes. We present novel methods for the systematic de novo identification of regulatory motifs. The methods do not rely on previous knowledge of gene function and in that way differ from the current literature on computational motif discovery. Based on the genome-wide conservation patterns of known motifs, we developed three conservation criteria that we used to discover novel motifs. We used an enumeration approach to select strongly conserved motif cores, which we extended and collapsed into a small number of candidate regulatory motifs. These include most previously known regulatory motifs as well as several noteworthy novel motifs. The majority of discovered motifs are enriched in functionally related genes, allowing us to infer a candidate function for novel motifs. Our results demonstrate the power of comparative genomics to further our understanding of any species. Our methods are validated by the extensive experimental knowledge in yeast, and will be invaluable in the study of complex genomes like that of human.

[1]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[2]  J. J. B. Anderson,et al.  Computational identification of cis-acting elements affecting post-transcriptional control of gene expression in Saccharomyces cerevisiae. , 2000, Nucleic acids research.

[3]  L. Pennacchio,et al.  Genomic strategies to identify mammalian regulatory sequences , 2001, Nature Reviews Genetics.

[4]  B. Berger,et al.  Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[5]  B. Dujon,et al.  Genomic Exploration of the Hemiascomycetous Yeasts: 4. The genome of Saccharomyces cerevisiae revisited , 2000, FEBS letters.

[6]  Michael Q. Zhang Promoter Analysis of Co-regulated Genes in the Yeast Genome , 1999, Comput. Chem..

[7]  W. Fitch Uses for evolutionary trees. , 1995, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[8]  P. Sharp,et al.  The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. , 1987, Nucleic acids research.

[9]  Tyson A. Clark,et al.  Genomewide Analysis of mRNA Processing in Yeast Using Splicing-Specific Microarrays , 2002, Science.

[10]  J. Fassler,et al.  Phylogenetic footprinting reveals multiple regulatory elements involved in control of the meiotic recombination gene, REC102 , 2002, Yeast.

[11]  E. Koonin,et al.  Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. , 2000, Nucleic acids research.

[12]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[13]  G. Church,et al.  Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. , 2000, Genome research.

[14]  L. Pachter,et al.  rVista for comparative sequence-based discovery of functional transcription factor binding sites. , 2002, Genome research.

[15]  Wei Zhou,et al.  Characterization of the Yeast Transcriptome , 1997, Cell.

[16]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[17]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[18]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[19]  C. Sensen,et al.  Complete DNA sequence of yeast chromosome XI , 1994, Nature.

[20]  B. Barrell,et al.  A Re-Annotation of the Saccharomyces Cerevisiae Genome , 2001, Comparative and functional genomics.

[21]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[22]  M. Gerstein,et al.  A question of size: the eukaryotic proteome and the problems in defining it. , 2002, Nucleic acids research.

[23]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[24]  M. Tompa Identifying functional elements by comparative DNA sequence analysis. , 2001, Genome research.

[25]  Ian Dunham,et al.  The Gene Guessing Game , 2000, Yeast.

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  L. Hurst The Ka/Ks ratio: diagnosing the form of sequence evolution. , 2002, Trends in genetics : TIG.

[28]  M. Aigle,et al.  Complete DNA sequence of yeast chromosome II. , 1994, The EMBO journal.

[29]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[30]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[31]  John Quackenbush,et al.  Gene Index analysis of the human genome estimates approximately 120,000 genes , 2000, Nature Genetics.

[32]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[33]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[34]  K. H. Wolfe,et al.  Molecular evidence for an ancient duplication of the entire yeast genome , 1997, Nature.

[35]  Benno Schwikowski,et al.  Algorithms for Phylogenetic Footprinting , 2002, J. Comput. Biol..

[36]  M. Blanchette,et al.  Discovery of regulatory elements by a computational method for phylogenetic footprinting. , 2002, Genome research.

[37]  R. Gibbs,et al.  Large-scale comparative sequence analysis of the human and murine Bruton's tyrosine kinase loci reveals conserved regulatory domains. , 1997, Genome research.

[38]  Pierre Baldi,et al.  Distribution patterns of over-represented k-mers in non-coding yeast DNA , 2002, Bioinform..

[39]  Sridhar Hannenhalli,et al.  Identification of transcription factor binding sites in the human genome sequence , 2002, Mammalian Genome.

[40]  William Noble Grundy,et al.  Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[41]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO) , 2002, Nucleic Acids Res..

[42]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[43]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[44]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[45]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[46]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[47]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[48]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[49]  S. Cebrat,et al.  Total number of coding open reading frames in the yeast genome , 1999, Yeast.

[50]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[51]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[52]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[53]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[54]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[55]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[56]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.