A Guide to the Mammalian Genome

Sequencing of a Transcriptome The rapid completion and public release of the genome sequences of mouse and human has led to a downgrading of the number of “genes” predicted in the mammalian genome to the region of 30,000 (Mouse Genome Sequencing Consortium, Waterston et al. 2002). In simpler organisms such as yeast, the estimate of gene number is comparatively straightforward, because the majority of the genome clearly encodes proteins, and individual genes generally have a welldefined start and finish and a single mRNA output. In mammals, the task is muchmore complex. Only a small proportion of the genome encodes mRNAs that in turn encode protein, and protein-coding sequence is interspersed with large introns or intergenic regions. Even protein coding genes have proven difficult to annotate reliably (Kawai et al. 2001), and non–protein coding genes are essentially impossible to annotate a priori. The key to reliable annotation of a mammalian genome is the comprehensive characterization of the transcriptional output, the transcriptome. There are two approaches to this problem. The most common is highthroughput sequencing of cDNA ends (ESTs). In mouse and human, and to a lesser extent in many other mammals, there are millions of EST sequences in various repositories. EST sequences can be computationally assembled into clusters, as in the UniGene projects (http : / /www.ncbi .nlm.nih.gov/ UniGene). There are many drawbacks with this approach, both from the cDNA cloning and sequence quality and from computational perspectives, but the most compelling is that the sequences are generated in silico and are not necessarily supported by a physical clone. It is also rather inefficient, because even with the best subtraction and normalization, abundant transcripts have been sequenced thousands of times, whereas many rare transcripts are absent from EST databases. EST assemblies are particularly difficult to interpret when there are multigene families or complex alternative splicing. The alternative approach is to systematically isolate and sequence fulllength cDNAs. The logistics of this approach are daunting, and it is actually far more challenging than is genomic sequencing, especially using shotgun approaches because of the difficulties in the collection of the samples. Nevertheless, the RIKEN Mouse Gene Encyclopedia Project has taken this approach. In the process, the RIKEN team has provided a model for eukaryotic transcriptome projects. The task required a range of new technologies and approaches. In outline, the RIKEN team developed new approaches to production of full-length cDNAs (Carninci et al. 2003) that required (1) a novel reverse transcriptase reaction (to enable effective complete firststrand synthesis), (2) novel 5 end capture technology, and (3) novel approaches to normalization and subtraction of cDNA libraries. Starting with their first libraries, the RIKEN team sequenced 3 ends (and later 5 ends) in a Phase 1 sequencing pipeline and, for each individual clone, determined whether the sequence had been sequenced previously or could be ascribed to a new cluster. In the second phase, individual representatives of EST clusters were selected and fully sequenced to produce a full-length cDNA sequence representing the sequence of an individual physical clone. At a number of stages in the project, the RIKEN team assembled a set of cDNAs that had previously been sequenced and used them to subtract successive libraries. The success of the approach is outlined in detail in Carninci et al. (2003). The output of this pipeline was analyzed in the FANTOM2 meeting (April 29 to May 5, 2002, Yokohama, Japan), which is the basis of this special issue of Genome Research.

[1]  Diego G. Silva,et al.  Identification of novel "pathologs" (human disease-related gene candidates) from the RIKEN full-length mouse cDNA data set , 2003 .

[2]  Y. Hayashizaki,et al.  Systematic expression profiling of the mouse transcriptome using RIKEN cDNA microarrays. , 2003, Genome research.

[3]  Martin Ringwald,et al.  Connecting sequence and biology in the laboratory mouse. , 2003, Genome research.

[4]  Yoshihide Hayashizaki,et al.  The mammalian protein-protein interaction database and its viewing system that is linked to the main FANTOM2 viewer. , 2003, Genome research.

[5]  D. Hill,et al.  G protein-coupled receptor genes in the FANTOM2 database. , 2003, Genome research.

[6]  Mitsutoshi Setou,et al.  Kinesin superfamily proteins (KIFs) in the mouse transcriptome. , 2003, Genome research.

[7]  Diego G. Silva,et al.  Inferring higher functional information for RIKEN mouse full-length cDNA clones with FACTS. , 2003, Genome research.

[8]  Terry Gaasterland,et al.  Systematic characterization of the zinc-finger-containing proteins in the mouse transcriptome. , 2003, Genome research.

[9]  Y. Hayashizaki,et al.  Comprehensive analysis of the mouse metabolome based on the transcriptome. , 2003, Genome research.

[10]  Yoshihide Hayashizaki,et al.  Discovery of imprinted transcripts in the mouse transcriptome using large-scale expression profiling. , 2003, Genome research.

[11]  Paul Denny,et al.  A comprehensive transcript map of the mouse Gnas imprinted complex. , 2003, Genome research.

[12]  Yoshihide Hayashizaki,et al.  CDS annotation in full-length cDNA sequence. , 2003, Genome research.

[13]  J. Blake,et al.  Human disease genes and their cloned mouse orthologs: exploration of the FANTOM2 cDNA sequence data set. , 2003, Genome research.

[14]  Yoshihide Hayashizaki,et al.  Antisense transcripts with FANTOM2 clone set and their implications for gene regulation. , 2003, Genome research.

[15]  S. Grimmond,et al.  Exploration of the cell-cycle genes found within the RIKEN FANTOM2 data set. , 2003, Genome research.

[16]  Hideo Matsuda,et al.  Development and evaluation of an automated annotation pipeline and cDNA annotation system. , 2003, Genome research.

[17]  M. Fagiolini,et al.  Targeting a complex transcriptome: the construction of the mouse full-length cDNA encyclopedia. , 2003, Genome research.

[18]  Diego G. Silva,et al.  Cytokine-related genes identified from the RIKEN full-length mouse cDNA data set. , 2003, Genome research.

[19]  Zheng Yuan,et al.  The mouse secretome: functional classification of the proteins secreted into the extracellular environment. , 2003, Genome research.

[20]  J. Mattick,et al.  Edinburgh Research Explorer Identification and analysis of chromodomain-containing proteins encoded in the mouse transcriptome , 2022 .

[21]  M. Tomita,et al.  Identification of putative noncoding RNAs among the RIKEN mouse full-length cDNA collection. , 2003, Genome research.

[22]  Thomas Huber,et al.  Phosphoregulators: protein kinases and protein phosphatases of mouse. , 2003, Genome research.

[23]  C. Semple,et al.  The comparative proteomics of ubiquitination in mouse. , 2003, Genome research.

[24]  Terry Gaasterland,et al.  Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. , 2003, Genome research.

[25]  John Quackenbush,et al.  Continued discovery of transcriptional units expressed in cells of the mouse mononuclear phagocyte lineage. , 2003, Genome research.

[26]  Ruchi M. Newman,et al.  Comparative analysis of apoptosis and inflammation genes of mice and humans. , 2003, Genome research.

[27]  Melissa J. Davis,et al.  Mouse proteome analysis. , 2003, Genome research.

[28]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[29]  E. Birney,et al.  Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs , 2002, Nature.

[30]  C. Bult,et al.  Functional annotation of a full-length mouse cDNA collection , 2001, Nature.

[31]  E Pennisi,et al.  Ideas Fly at Gene-Finding Jamboree , 2000, Science.

[32]  Olive Lloyd-Baker IDENTIFICATION OF NOVEL , 1964 .