Complete sequencing and characterization of 21,243 full-length human cDNAs

As a base for human transcriptome and functional genomics, we created the “full-length long Japan” (FLJ) collection of sequenced human cDNAs. We determined the entire sequence of 21,243 selected clones and found that 14,490 cDNAs (10,897 clusters) were unique to the FLJ collection. About half of them (5,416) seemed to be protein-coding. Of those, 1,999 clusters had not been predicted by computational methods. The distribution of GC content of nonpredicted cDNAs had a peak at ∼58% compared with a peak at ∼42%for predicted cDNAs. Thus, there seems to be a slight bias against GC-rich transcripts in current gene prediction procedures. The rest of the cDNAs unique to the FLJ collection (5,481) contained no obvious open reading frames (ORFs) and thus are candidate noncoding RNAs. About one-fourth of them (1,378) showed a clear pattern of splicing. The distribution of GC content of noncoding cDNAs was narrow and had a peak at ∼42%, relatively low compared with that of protein-coding cDNAs.

N. Nomura | Y. Masuho | H. Hishigaki | K. Nakai | A. Tanigami | T. Ishibashi | Toshihiro Tanaka | M. Sekine | S. Sugano | Kouichi Kimura | A. Wakamatsu | Yutaka Suzuki | T. Ota | T. Nishikawa | R. Yamashita | J. Yamamoto | S. Ishii | T. Sugiyama | Kaoru Saito | Yuko Isono | R. Irie | K. Kanda | Takahide Yokoi | H. Kondo | M. Wagatsuma | K. Murakawa | Shinichi Ishida | A. Takahashi-Fujii | Tomo-o Tanase | K. Nagai | H. Kikuchi | T. Isogai | T. Otsuki | K. Hayashi | Hiroyuki Sato | H. Makita | Masaya Obayashi | T. Nishi | T. Shibahara | Y. Kawai | Y. Nakamura | K. Nagahari | K. Murakami | Tomohiro Yasuda | T. Iwayanagi | A. Shiratori | Hiroaki Sudo | Takehiko Hosoiri | Yoshiko Kaku | H. Kodaira | M. Sugawara | Makiko Takahashi | T. Furuya | E. Kikkawa | Yuhi Omura | Kumiko Abe | K. Kamihara | N. Katsuta | Kazuo Sato | M. Tanikawa | M. Yamazaki | Kenji Ninomiya | Hiromichi Yamashita | K. Fujimori | H. Tanai | M. Kimata | Motoji Watanabe | S. Hiraoka | Yoshiyuki Chiba | Yukio Ono | Sumiyo Takiguchi | Susumu Watanabe | Makoto Yosida | T. Hotuta | Junko Kusano | K. Kanehori | Hiroto Hara | Y. Nomura | Sakae Togiya | Fukuyo Komai | R. Hara | K. Takeuchi | M. Arita | Nobuyuki Imose | Kaoru Musashino | Hisatsugu Yuuki | Atsushi Oshima | Naokazu Sasaki | S. Aotsuka | Y. Yoshikawa | Hiroshi Matsunawa | Tatsuo Ichihara | N. Shiohata | Sanae Sano | S. Moriya | H. Momiyama | N. Satoh | S. Takami | Y. Terashima | Osamu Suzuki | Satoshi Nakagawa | A. Senoh | H. Mizoguchi | Yoshihiro Goto | F. Shimizu | H. Wakebe | Takeshi K. Watanabe | A. Sugiyama | M. Takemoto | B. Kawakami | M. Yamazaki | Koji Watanabe | Ayako Kumagai | S. Itakura | Y. Fukuzumi | Y. Fujimori | M. Komiyama | H. Tashiro | T. Fujiwara | T. Ono | K. Yamada | Y. Fujii | K. Ozaki | M. Hirao | Y. Ohmori | A. Kawabata | T. Hikiji | Naoko Kobatake | Hiromichi Inagaki | Y. Ikema | Sachiko Okamoto | Rie Okitani | Takuma Kawakami | S. Noguchi | T. Itoh | Keiko Shigeta | Tadashi Senba | K. Matsumura | Y. Nakajima | T. Mizuno | M. Morinaga | M. Sasaki | T. Togashi | M. Oyama | H. Hata | Manabu Watanabe | T. Komatsu | J. Mizushima-Sugano | T. Satoh | Yuko Shirai | Yukiko Y. Takahashi | Kiyomi Nakagawa | K. Okumura | T. Nagase | T. Yada | Yusuke Nakamura | O. Ohara | M. Watanabe | Manabu Watanabe | Keiichi Nagai | O. Suzuki | Toshikazu Shibahara | Maasa Hirao | Takami Komatsu | Tetsuji Otsuki | Ai Wakamatsu

[1]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[2]  B. Obermaier,et al.  Rapid sequencing of the Sendai virus 6.8 kb large (L) gene through primer walking with an automated DNA sequencer. , 1992, Journal of virological methods.

[3]  G. Bernardi,et al.  The isochore organization of the human genome and its evolutionary history--a review. , 1993, Gene.

[4]  N. Nomura,et al.  Prediction of the coding sequences of unidentified human genes. I. The coding sequences of 40 new genes (KIAA0001-KIAA0040) deduced by analysis of randomly sampled cDNA clones from human immature myeloid cell line KG-1 (supplement). , 1994, DNA research : an international journal for rapid publication of reports on genes and genomes.

[5]  K. Maruyama,et al.  Oligo-capping: a simple method to replace the cap structure of eukaryotic mRNAs with oligoribonucleotides. , 1994, Gene.

[6]  N. Nomura,et al.  Prediction of the coding sequences of unidentified human genes. I. The coding sequences of 40 new genes (KIAA0001-KIAA0040) deduced by analysis of randomly sampled cDNA clones from human immature myeloid cell line KG-1. , 1994, DNA research : an international journal for rapid publication of reports on genes and genomes.

[7]  M. Boguski The turning point in genome research. , 1995, Trends in biochemical sciences.

[8]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[9]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[10]  Y. Suzuki,et al.  Construction and characterization of a full length-enriched and a 5'-end-enriched cDNA library. , 1997, Gene.

[11]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[12]  J. Fickett,et al.  Predictive methods using nucleotide sequences. , 2006, Methods of biochemical analysis.

[13]  R D Klausner,et al.  The mammalian gene collection. , 1999, Science.

[14]  Melanie E. Goward,et al.  The DNA sequence of human chromosome 22 , 1999, Nature.

[15]  K. Okumura,et al.  Characterization of long cDNA clones from human adult spleen. , 2000, DNA Research.

[16]  K. Noma,et al.  Tnat1 and Tnat2 from Arabidopsis thaliana: novel transposable elements with tandem repeat sequences. , 2000, DNA research : an international journal for rapid publication of reports on genes and genomes.

[17]  M. Hattori,et al.  The DNA sequence of human chromosome 21 , 2000, Nature.

[18]  A. Hüttenhofer,et al.  RNomics: an experimental approach that identifies 201 candidates for novel, small, non‐messenger RNAs in mouse , 2001, The EMBO journal.

[19]  A Suyama,et al.  Diverse transcriptional initiation revealed by fine, large‐scale mapping of mRNA start sites , 2001, EMBO reports.

[20]  Hiroshi Matsui,et al.  HUNT: launch of a full-length cDNA database from the Helix Research Institute , 2001, Nucleic Acids Res..

[21]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[22]  H. Mewes,et al.  Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs. , 2001, Genome research.

[23]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[24]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[25]  The Wellcome Trust Sanger Institute The DNA sequence and comparative analysis of human chromosome 20 , 2001 .

[26]  M. Vidal,et al.  Structural genomics: A pipeline for providing structures for the biologist , 2002, Protein science : a publication of the Protein Society.

[27]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[28]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[29]  S. P. Fodor,et al.  Large-Scale Transcriptional Activity in Chromosomes 21 and 22 , 2002, Science.

[30]  E. Birney,et al.  Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs , 2002, Nature.

[31]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..