A Biomedically Enriched Collection of 7000 Human ORF Clones

We report the production and availability of over 7000 fully sequence verified plasmid ORF clones representing over 3400 unique human genes. These ORF clones were derived using the human MGC collection as template and were produced in two formats: with and without stop codons. Thus, this collection supports the production of either native protein or proteins with fusion tags added to either or both ends. The template clones used to generate this collection were enriched in three ways. First, gene redundancy was removed. Second, clones were selected to represent the best available GenBank reference sequence. Finally, a literature-based software tool was used to evaluate the list of target genes to ensure that it broadly reflected biomedical research interests. The target gene list was compared with 4000 human diseases and over 8500 biological and chemical MeSH classes in ∼15 Million publications recorded in PubMed at the time of analysis. The outcome of this analysis revealed that relative to the genome and the MGC collection, this collection is enriched for the presence of genes with published associations with a wide range of diseases and biomedical terms without displaying a particular bias towards any single disease or concept. Thus, this collection is likely to be a powerful resource for researchers who wish to study protein function in a set of genes with documented biomedical significance.

[1]  Yanhui Hu,et al.  Approaching a complete repository of sequence-verified protein-encoding clones for Saccharomyces cerevisiae. , 2007, Genome research.

[2]  J. Rogers,et al.  hORFeome v3.1: A resource of human open reading frames representing over 10,000 human genes , 2007, Genomics.

[3]  R Staden,et al.  Sequence assembly and finishing methods. , 2002, Methods of biochemical analysis.

[4]  O. Griffith,et al.  Systematic recovery and analysis of full-ORF human cDNA clones. , 2004, Genome research.

[5]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[6]  F. Vannberg,et al.  Building a human kinase gene repository: bioinformatics, molecular cloning, and functional validation. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Elizabeth Pennisi,et al.  Working the (Gene Count) Numbers: Finally, a Firm Answer? , 2007, Science.

[8]  Yanhui Hu,et al.  Functional proteomics approach to investigate the biological activities of cDNAs implicated in breast cancer. , 2006, Journal of proteome research.

[9]  Yanhui Hu,et al.  A novel approach to sequence validating protein expression clones with automated decision making , 2007, BMC Bioinformatics.

[10]  Ryan D. Morin,et al.  The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). , 2004, Genome research.

[11]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[12]  Bhupinder Bhullar,et al.  Self-Assembling Protein Microarrays , 2004, Science.

[13]  O. Harrison,et al.  Cadherin adhesion depends on a salt bridge at the N-terminus , 2005, Journal of Cell Science.

[14]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[15]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence Project: update and current status , 2003, Nucleic Acids Res..

[16]  Younes Mokrab,et al.  A genome annotation-driven approach to cloning the human ORFeome , 2004, Genome Biology.

[17]  Yanhui Hu,et al.  A Full-Genomic Sequence-Verified Protein-Coding Gene Collection for Francisella tularensis , 2007, PloS one.

[18]  Lukas Wagner,et al.  From genome to proteome: developing expression clone resources for the human genome. , 2006, Human molecular genetics.

[19]  T. Moore,et al.  Human ORFeome version 1.1: a platform for reverse proteomics. , 2004, Genome research.

[20]  Giovanni Dietler,et al.  Functional dynamics of PDZ binding domains: a normal-mode analysis. , 2005, Biophysical journal.

[21]  M. Rivera,et al.  Analysis of genomic and proteomic data using advanced literature mining. , 2003, Journal of proteome research.

[22]  M Vingron,et al.  Primer design for large scale sequencing. , 1998, Nucleic acids research.

[23]  R. Hegde,et al.  The surprising complexity of signal sequences. , 2006, Trends in biochemical sciences.

[24]  G. Rubin,et al.  Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.