Assembly, annotation, and integration of UNIGENE clusters into the human genome draft.

The recent release of the first draft of the human genome provides an unprecedented opportunity to integrate human genes and their functions in a complete positional context. However, at least three significant technical hurdles remain: first, to assemble a complete and nonredundant human transcript index; second, to accurately place the individual transcript indices on the human genome; and third, to functionally annotate all human genes. Here, we report the extension of the UNIGENE database through the assembly of its sequence clusters into nonredundant sequence contigs. Each resulting consensus was aligned to the human genome draft. A unique location for each transcript within the human genome was determined by the integration of the restriction fingerprint, assembled genomic contig, and radiation hybrid (RH) maps. A total of 59,500 UNIGENE clusters were mapped on the basis of at least three independent criteria as compared with the 30,000 human genes/ESTs currently mapped in Genemap'99. Finally, the extension of the human transcript consensus in this study enabled a greater number of putative functional assignments than the 11,000 annotated entries in UNIGENE. This study reports a draft physical map with annotations for a majority of the human transcripts, called the Human Index of Nonredundant Transcripts (HINT). Such information can be immediately applied to the discovery of new genes and the identification of candidate genes for positional cloning.

[1]  M S Boguski,et al.  Gene discovery in dbEST. , 1994, Science.

[2]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[3]  K. O. Elliston,et al.  Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. , 1996, Genome research.

[4]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[5]  Christopher J. Lee,et al.  Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences , 2000, Nature Genetics.

[6]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[7]  John Quackenbush,et al.  Gene Index analysis of the human genome estimates approximately 120,000 genes , 2000, Nature Genetics.

[8]  E. Lander,et al.  Genome maps 7. The human transcript map. Wall chart. , 1996, Science.

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  M. Gelfand,et al.  Frequent alternative splicing of human genes. , 1999, Genome research.

[11]  A. Ashworth Two acetyl-CoA acetyltransferase genes located in the t-complex region of mouse chromosome 17 partially overlap the Tcp-1 and Tcp-1x genes. , 1993, Genomics.

[12]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[13]  Gregory D Schuler,et al.  Sequence mapping by electronic PCR , 1997, Genome research.

[14]  P. Lijnzaad,et al.  A physical map of 30,000 human genes. , 1998, Science.

[15]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[16]  Peter B. McGarvey,et al.  The Protein Information Resource (PIR) , 2000, Nucleic Acids Res..

[17]  D B Davison,et al.  Alternative gene form discovery and candidate gene selection from gene indexing projects. , 1998, Genome research.

[18]  R. Wilson,et al.  High throughput fingerprint analysis of large-insert clones. , 1997, Genome research.

[19]  P. Richterich,et al.  Estimation of errors in "raw" DNA sequences: a validation study. , 1998, Genome research.

[20]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[21]  E. Eichler,et al.  Masquerading repeats: paralogous pitfalls of the human genome. , 1998, Genome research.

[22]  H R Garner,et al.  Repeat polymorphisms within gene regions: phenotypic and evolutionary implications. , 2000, American journal of human genetics.

[23]  M S Boguski,et al.  Late-night thoughts on the sequence annotation problem. , 1998, Genome research.

[24]  Michael N. Edmonson,et al.  Reliable identification of large numbers of candidate SNPs from public EST data , 1999, Nature Genetics.

[25]  False association of human ESTs , 1994, Nature Genetics.

[26]  E. Mardis,et al.  Generation and analysis of 280,000 human expressed sequence tags. , 1996, Genome research.

[27]  John Quackenbush,et al.  The TIGR Gene Indices: reconstruction and representation of expressed gene sequences , 2000, Nucleic Acids Res..

[28]  P. Green,et al.  Analysis of expressed sequence tags indicates 35,000 human genes , 2000, Nature Genetics.

[29]  Rolf Apweiler,et al.  Representation of functional information in the SWISS-PROT Data Bank , 1999, Bioinform..

[30]  S. Bentolila,et al.  The Genexpress Index: a resource for gene discovery and the genic map of the human genome. , 1995, Genome research.