Targeted discovery of novel human exons by comparative genomics.

A complete and accurate set of human protein-coding gene annotations is perhaps the single most important resource for genomic research after the human-genome sequence itself, yet the major gene catalogs remain incomplete and imperfect. Here we describe a genome-wide effort, carried out as part of the Mammalian Gene Collection (MGC) project, to identify human genes not yet in the gene catalogs. Our approach was to produce gene predictions by algorithms that rely on comparative sequence data but do not require direct cDNA evidence, then to test predicted novel genes by RT-PCR. We have identified 734 novel gene fragments (NGFs) containing 2188 exons with, at most, weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which >160 are completely absent from the major gene catalogs, while hundreds of others represent significant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue-specific manner, and they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development. Our results demonstrate that many important genes and gene fragments have been missed by traditional approaches to gene discovery but can be identified by their evolutionary signatures using comparative sequence data. However, they suggest that hundreds-not thousands-of protein-coding genes are completely missing from the current gene catalogs.

[1]  James A. Cuff,et al.  Distinguishing protein-coding and noncoding genes in the human genome , 2007, Proceedings of the National Academy of Sciences.

[2]  David Haussler,et al.  Comparative Genomics Search for Losses of Long-Established Genes on the Human Lineage , 2007, PLoS Comput. Biol..

[3]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[4]  P. Stadler,et al.  RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription , 2007, Science.

[5]  Charlotte N. Henrichsen,et al.  Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. , 2007, Genome research.

[6]  M. Gerstein,et al.  What is a gene, post-ENCODE? History and updated definition. , 2007, Genome research.

[7]  Philipp Kapranov,et al.  Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. , 2007, Genome research.

[8]  T. Mikkelsen,et al.  Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites , 2007, Proceedings of the National Academy of Sciences.

[9]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[10]  E. Glasgow,et al.  Zebrafish orthopedia (otp) is required for isotocin cell development , 2007, Development Genes and Evolution.

[11]  Alan M. Moses,et al.  In vivo enhancer analysis of human conserved non-coding sequences , 2006, Nature.

[12]  D. Haussler,et al.  An RNA gene expressed during cortical development evolved rapidly in humans , 2006, Nature.

[13]  Leo Goodstadt,et al.  Phylogenetic Reconstruction of Orthology, Paralogy, and Conserved Synteny for Dog and Human , 2006, PLoS Comput. Biol..

[14]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[15]  M. Brent,et al.  Pairagon+N-SCAN_EST: a model-based gene annotation pipeline , 2006, Genome Biology.

[16]  D. Haussler,et al.  A distal enhancer and an ultraconserved exon are derived from a novel retroposon , 2006, Nature.

[17]  M. Brent,et al.  Iterative gene prediction and pseudogene removal improves genome annotation. , 2006, Genome research.

[18]  Martin S. Taylor,et al.  Genome-wide analysis of mammalian promoter architecture and evolution , 2006, Nature Genetics.

[19]  J. Mattick,et al.  Non-coding RNA. , 2006, Human molecular genetics.

[20]  K. Guegler,et al.  An efficient and high-throughput approach for experimental validation of novel human gene predictions. , 2006, Genomics.

[21]  B. Frey,et al.  How Many New Genes Are There? , 2006, Science.

[22]  E. Fisher,et al.  Genetic Analysis of the Cytoplasmic Dynein Subunit Families , 2006, PLoS genetics.

[23]  A. Reymond,et al.  Tandem chimerism as a means to increase protein complexity in the human genome. , 2005, Genome research.

[24]  K. Nakai,et al.  Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. , 2005, Genome research.

[25]  S. Salzberg,et al.  The Transcriptional Landscape of the Mammalian Genome , 2005, Science.

[26]  G. Helt,et al.  Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution , 2005, Science.

[27]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[28]  James G. R. Gilbert,et al.  The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[29]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[30]  Eduardo Eyras,et al.  Gene finding in the chicken genome , 2005, BMC Bioinformatics.

[31]  Thomas E. Royce,et al.  Global Identification of Human Transcribed Sequences with Genome Tiling Arrays , 2004, Science.

[32]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[33]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[34]  O. Griffith,et al.  Systematic recovery and analysis of full-ORF human cDNA clones. , 2004, Genome research.

[35]  David L. Steffen,et al.  Large-scale RT-PCR recovery of full-length cDNA clones. , 2004, BioTechniques.

[36]  Manimozhiyan Arumugam,et al.  Identification of rat genes by TWINSCAN gene prediction, RT-PCR, and direct sequencing. , 2004, Genome research.

[37]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[38]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[39]  D. Bartel MicroRNAs Genomics, Biogenesis, Mechanism, and Function , 2004, Cell.

[40]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[41]  V. Broccoli,et al.  Bsx, an evolutionary conserved Brain Specific homeoboX gene expressed in the septum, epiphysis, mammillary bodies and arcuate nucleus. , 2004, Gene expression patterns : GEP.

[42]  Ryan D. Morin,et al.  The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). , 2004, Genome research.

[43]  Stephen M. Mount,et al.  Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. , 2003, Nucleic acids research.

[44]  M. Brent,et al.  Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[45]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[46]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[47]  M. Brent,et al.  Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. , 2003, Genome research.

[48]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[49]  R. Vallee,et al.  Molecular structure of cytoplasmic dynein 2 and its distribution in neuronal and ciliated cells , 2002, Journal of Cell Science.

[50]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[51]  Brenda L Bass,et al.  RNA editing by adenosine deaminases that act on RNA. , 2002, Annual review of biochemistry.

[52]  M. Cohen-Salmon,et al.  Spatiotemporal expression of otogelin in the developing and adult mouse inner ear , 2001, Hearing Research.

[53]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[54]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[55]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[56]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[57]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[58]  R D Klausner,et al.  The mammalian gene collection. , 1999, Science.

[59]  M. Cohen-Salmon,et al.  Otogelin: a glycoprotein specific to the acellular membranes of the inner ear. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[60]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[61]  S. C. Lakhotia,et al.  What is a gene? , 1997 .

[62]  E. Mardis,et al.  Generation and analysis of 280,000 human expressed sequence tags. , 1996, Genome research.

[63]  L. Leinwand,et al.  The mammalian myosin heavy chain gene family. , 1996, Annual review of cell and developmental biology.

[64]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[65]  J. Craig Venter,et al.  Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library , 1993, Nature Genetics.

[66]  J. Craig Venter,et al.  3,400 new expressed sequence tags identify diversity of transcripts in human brain , 1993, Nature Genetics.

[67]  J. Mattick,et al.  'Touchdown' PCR to circumvent spurious priming during gene amplification. , 1991, Nucleic acids research.

[68]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[69]  D. Wake,et al.  Phylogenetic reconstruction. , 1978, Science.

[70]  C. Coulson,et al.  Molecular Structure , 1973, Nature.

[71]  Howard C. Berg,et al.  Genetic analysis , 1957, Nature Biotechnology.