Pieces of the puzzle: expressed sequence tags and the catalog of human genes

Imagine trying to solve a jigsaw puzzle without having all of the pieces. This is exactly the dilemma faced by researchers in the field of molecular medicine when attempting to understand how human genes and their protein products interact with one another to lead to normal biological functions, how these functions can break down in various disease states, and how normal functions can be restored through molecular intervention. This description of the Puzzle of Life is not meant to deny the importance of environmental and other epigenetic factors, but is simply meant to define the boundaries of a puzzle whose solution is easily within our grasp. To further our basic understanding of human biology and the genetics of inherited diseases, it would be immensely valuable to compile a complete catalog of human gene sequences and to make this information available over the Internet to scientists around the world. Over the past few years huge amounts of data relevant to this puzzle have become available, but solving the puzzle remains a bioinformatics challenge. Before setting out to solve the Puzzle of Life, it would be useful to have a rough sense of how many pieces it contains. In other words, how many human genes are there? Based on indirect evidence, estimates ranging from approximately 64,000 [1] to 80,000 [2] genes have been advanced. Complete genomic sequencing has been used to generate gene catalogs for several organisms with relatively small genomes [3]. However, sequencing the human genome is a much more daunting task due to its immense size (about 3 billion bases). The United States Genome Project began in 1990 with the ambitious goal of sequencing the human genome within 15 years (i.e., by the year 2005) [4]. Unfortunately, only about 2% of the total bases make up the protein-coding portions of our genes; the remaining 98% is of unknown function and often referred to as “junk DNA.” Thus, sequencing the genome may not be the most efficient way to generate a catalog of human genes. A number of investigators have advocated large-scale sequencing of the transcription products of genes, in the form of complimentary DNA (cDNA) clones, as a prelude to sequencing of the entire human genome. As Brenner [5] put it, “If something like 98% of the genome is junk, then the best strategy would be to find the important 2%, and sequence it first.” An abundance of puzzle pieces

[1]  L. Hood,et al.  A common language for physical mapping of the human genome. , 1989, Science.

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  S. Brenner,et al.  The human genome: the nature of the enterprise. , 2007, Ciba Foundation symposium.

[4]  J. Sikela,et al.  Use of 3' untranslated sequences of human cDNAs for rapid chromosome assignment and conversion to STSs: implications for an expression map of the genome. , 1991, Nucleic acids research.

[5]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[6]  N. Halloran,et al.  A survey of expressed genes in Caenorhabditis elegans , 1992, Nature Genetics.

[7]  M. Adams,et al.  Caenorhabditis elegans expressed sequence tags identify gene families and potential disease gene homologues , 1992, Nature Genetics.

[8]  K. Okubo,et al.  cDNA analyses in the human genome project. , 1993, Gene.

[9]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[10]  A. Bird,et al.  Number of CpG islands and genes in human and mouse. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[11]  C. Auffray,et al.  Finding new genes faster than ever , 1993, Nature Genetics.

[12]  J. Craig Venter,et al.  3,400 new expressed sequence tags identify diversity of transcripts in human brain , 1993, Nature Genetics.

[13]  D. Galas,et al.  A new five-year plan for the U.S. Human Genome Project. , 1993, Science.

[14]  M S Boguski,et al.  Gene discovery in dbEST. , 1994, Science.

[15]  Frans,et al.  Genes Galore: A Summary of Methods for Accessing Results from Large-Scale Partial Sequencing of Anonymous Arabidopsis cDNA Clones , 1994, Plant physiology.

[16]  M. Adams,et al.  How many genes in the human genome? , 1994, Nature Genetics.

[17]  M. Soares,et al.  Construction and characterization of a normalized cDNA library. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[19]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[20]  S. Bentolila,et al.  The Genexpress Index: a resource for gene discovery and the genic map of the human genome. , 1995, Genome research.

[21]  Francis S. Collins,et al.  Positional cloning moves from perditional to traditional , 1995, Nature Genetics.

[22]  L Kruglyak,et al.  An STS-Based Map of the Human Genome , 1995, Science.

[23]  M. Brennan,et al.  COMMENTARY: So many needles, so much hay , 1995 .

[24]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[25]  M. Seldin,et al.  Human/mouse homology relationships. , 1996, Genomics.

[26]  L. Penland,et al.  Use of a cDNA microarray to analyse gene expression patterns in human cancer , 1996, Nature Genetics.

[27]  M. Soares,et al.  Normalization and subtraction: two approaches to facilitate gene discovery. , 1996, Genome research.

[28]  H. Friess,et al.  A pancreatic cancer-specific expression profile. , 1996, Oncogene.

[29]  E. Mardis,et al.  Generation and analysis of 280,000 human expressed sequence tags. , 1996, Genome research.

[30]  M. Boguski,et al.  Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. , 1996, Genome research.

[31]  C. Auffray,et al.  The I.M.A.G.E. Consortium: an integrated molecular analysis of genomes and their expression. , 1996, Genomics.

[32]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[33]  P. Deloukas,et al.  A Gene Map of the Human Genome , 1996, Science.

[34]  L. Liotta,et al.  Laser capture microdissection. , 2006, Methods in molecular biology.

[35]  Carol A. Dahl,et al.  New opportunities for uncovering the molecular basis of cancer , 1997, Nature Genetics.

[36]  E. Koonin Big time for small genomes. , 1997, Genome research.

[37]  R H Hruban,et al.  Gene expression profiles in normal and cancer cells. , 1997, Science.

[38]  R. W. Davis,et al.  Discovery and analysis of inflammatory disease-related genes using cDNA microarrays. , 1997, Proceedings of the National Academy of Sciences of the United States of America.