Gene prediction and verification in a compact genome with numerous small introns.

The genomes of clusters of related eukaryotes are now being sequenced at an increasing rate, creating a need for accurate, low-cost annotation of exon-intron structures. In this paper, we demonstrate that reverse transcription-polymerase chain reaction (RT-PCR) and direct sequencing based on predicted gene structures satisfy this need, at least for single-celled eukaryotes. The TWINSCAN gene prediction algorithm was adapted for the fungal pathogen Cryptococcus neoformans by using a precise model of intron lengths in combination with ungapped alignments between the genome sequences of the two closely related Cryptococcus varieties. This approach resulted in approximately 60% of known genes being predicted exactly right at every coding base and splice site. When previously unannotated TWINSCAN predictions were tested by RT-PCR and direct sequencing, 75% of targets spanning two predicted introns were amplified and produced high-quality sequence. When targets spanning the complete predicted open reading frame were tested, 72% of them amplified and produced high-quality sequence. We conclude that sequencing a small number of expressed sequence tags (ESTs) to provide training data, running TWINSCAN on an entire genome, and then performing RT-PCR and direct sequencing on all of its predictions would be a cost-effective method for obtaining an experimentally verified genome annotation.

[1]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[2]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[3]  E. Jacobson,et al.  Ferrous Iron Uptake in Cryptococcus neoformans , 1998, Infection and Immunity.

[4]  S. Salzberg,et al.  Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[5]  S Rozen,et al.  Primer3 on the WWW for general users and for biologist programmers. , 2000, Methods in molecular biology.

[6]  S. Cawley,et al.  Phat--a gene finding program for Plasmodium falciparum. , 2001, Molecular and biochemical parasitology.

[7]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[8]  G. Rubin,et al.  Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  J. Hudson,et al.  C. elegans ORFeome version 1.1: experimental verification of the genome annotation and resource for proteome-scale protein expression , 2003, Nature Genetics.

[10]  M. Brent,et al.  Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. , 2003, Genome research.

[11]  Jonathan E. Allen,et al.  Computational gene prediction using multiple sources of evidence. , 2003, Genome research.

[12]  Ryan D. Morin,et al.  The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). , 2004, Genome research.

[13]  Manimozhiyan Arumugam,et al.  Identification of rat genes by TWINSCAN gene prediction, RT-PCR, and direct sequencing. , 2004, Genome research.