Analysis of EST-driven gene annotation in human genomic sequence.

We have performed a systematic analysis of gene identification in genomic sequence by similarity search against expressed sequence tags (ESTs) to assess the suitability of this method for automated annotation of the human genome. A BLAST-based strategy was constructed to examine the potential of this approach, and was applied to test sets containing all human genomic sequences longer than 5 kb in public databases, plus 300 kb of exhaustively characterized benchmark sequence. At high stringency, 70%-90% of all annotated genes are detected by near-identity to EST sequence; >95% of ESTs aligning with well-annotated sequences overlap a gene. These ESTs provide immediate access to the corresponding cDNA clones for follow-up laboratory verification and subsequent biologic analysis. At lower stringency, up to 97% of annotated genes were identified by similarity to ESTs. The apparent false-positive rate rose to 55% of ESTs among all sequences and 20% among benchmark sequences at the lowest stringency, indicating that many genes in public database entries are unannotated. Approximately half of the alignments span multiple exons, and thus aid in the construction of gene predictions and elucidation of alternative splicing. In addition, ESTs from multiple cDNA libraries frequently cluster over genes, providing a starting point for crude expression profiles. Clone IDs may be used to form EST pairs, and particularly to extend models by associating alignments of lower stringency with high-quality alignments. These results demonstrate that EST similarity search is a practical general-purpose annotation technique that complements pattern recognition methods as a tool for gene characterization.

[1]  Williamson The Merck Gene Index project. , 1999, Drug discovery today.

[2]  J. Schug,et al.  GAIA: framework annotation of genomic sequence. , 1998, Genome research.

[3]  A. Feinberg,et al.  A 2.5-Mb transcript map of a tumor-suppressing subchromosomal transferable fragment from 11p15.5, and isolation and sequence analysis of three novel genes. , 1997, Genomics.

[4]  K. Frazer,et al.  Computational and biological analysis of 680 kb of DNA sequence from the human 5q31 cytokine gene cluster region. , 1997, Genome research.

[5]  R. Wolff,et al.  A 1.1-Mb transcript map of the hereditary hemochromatosis locus. , 1997, Genome research.

[6]  T G Wolfsberg,et al.  A comparison of expressed sequence tags (ESTs) to human genomic sequences. , 1997, Nucleic acids research.

[7]  R. Gibbs,et al.  Large-scale sequencing in human chromosome 12p13: experimental and computational gene structure determination. , 1997, Genome research.

[8]  V. Bedian,et al.  A gene belonging to the Sm family of snRNP core proteins maps within the mouse MHC , 1997, Immunogenetics.

[9]  J. Greene,et al.  Molecular Cloning and Characterization of Human Tissue Inhibitor of Metalloproteinase 4* , 1996, The Journal of Biological Chemistry.

[10]  H. Heng,et al.  A novel gene codes for a putative G protein‐coupled receptor with an abundant expression in brain , 1996, FEBS letters.

[11]  R. Mazzarella,et al.  Ordered shotgun sequencing of a 135 kb Xq25 YAC containing ANT2 and four possible genes, including three confirmed by EST matches. , 1996, Nucleic acids research.

[12]  E. Mardis,et al.  Generation and analysis of 280,000 human expressed sequence tags. , 1996, Genome research.

[13]  K. O. Elliston,et al.  Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. , 1996, Genome research.

[14]  M. Boguski,et al.  Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. , 1996, Genome research.

[15]  A V Carrano,et al.  Sequence analysis of the ERCC2 gene regions in human, mouse, and hamster reveals three linked genes. , 1996, Genomics.

[16]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[17]  B. Roe,et al.  A transcription map of the DiGeorge and velo-cardio-facial syndrome minimal critical region on 22q11. , 1996, Human molecular genetics.

[18]  A. Poustka,et al.  Transcription mapping in a 700-kb region around the DXS52 locus in Xq28: isolation of six novel transcripts and a novel ATPase isoform (hPMCA5). , 1996, Genome research.

[19]  G. Borsani,et al.  Identification and mapping of human cDNAs homologous to Drosophila mutant genes through EST database searching , 1996, Nature Genetics.

[20]  P. Majerus,et al.  Isolation of Inositol 1,3,4-Trisphosphate 5/6-Kinase, cDNA Cloning, and Expression of the Recombinant Enzyme (*) , 1996, The Journal of Biological Chemistry.

[21]  Jerzy Jurka,et al.  Censor - a Program for Identification and Elimination of Repetitive Elements From DNA Sequences , 1996, Comput. Chem..

[22]  John O'Neill,et al.  The Genome Sequence DataBase (GSDB): meeting the challenge of genomic sequencing , 1996, Nucleic Acids Res..

[23]  O. White,et al.  TDB: new databases for biological discovery. , 1996, Methods in enzymology.

[24]  E. Uberbacher,et al.  Discovering and understanding genes in human DNA sequence using GRAIL. , 1996, Methods in enzymology.

[25]  N. Copeland,et al.  Isolation of LERK-5: a ligand of the eph-related receptor tyrosine kinases. , 1995, Molecular immunology.

[26]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[27]  M S Boguski,et al.  Comparative genomics, genome cross-referencing and XREFdb. , 1995, Trends in genetics : TIG.

[28]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[29]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[30]  S. Berget Exon Recognition in Vertebrate Splicing (*) , 1995, The Journal of Biological Chemistry.

[31]  R. Vallee,et al.  Beta-centractin: characterization and distribution of a new member of the centractin family of actin-related proteins. , 1994, Molecular biology of the cell.

[32]  M. Adams,et al.  How many genes in the human genome? , 1994, Nature Genetics.

[33]  G. Christian Overton,et al.  QGB: A System for Querying Sequence Database Fields and Features , 1994, J. Comput. Biol..

[34]  A. Bird,et al.  Number of CpG islands and genes in human and mouse. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Chuan Yi Tang,et al.  A 2.|E|-Bit Distributed Algorithm for the Directed Euler Trail Problem , 1993, Inf. Process. Lett..

[36]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[37]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.