Analysis of human mRNAs with the reference genome sequence reveals potential errors, polymorphisms, and RNA editing.

The NCBI Reference Sequence (RefSeq) project and the NIH Mammalian Gene Collection (MGC) together define a set of approximately 30,000 nonredundant human mRNA sequences with identified coding regions representing 17,000 distinct loci. These high-quality mRNA sequences allow for the identification of transcribed regions in the human genome sequence, and many researchers accept them as the correct representation of each defined gene sequence. Computational comparison of these mRNA sequences and the recently published essentially finished human genome sequence reveals several thousand undocumented nonsynonymous substitution and frame shift discrepancies between the two resources. Additional analysis is undertaken to verify that the euchromatic human genome is sufficiently complete--containing nearly the whole mRNA collection, thus allowing for a comprehensive analysis to be undertaken. Many of the discrepancies will prove to be genuine polymorphisms in the human population, somatic cell genomic variants, or examples of RNA editing. It is observed that the genome sequence variant has significant additional support from other mRNAs and ESTs, almost four times more often than does the mRNA variant, suggesting that the genome sequence is more accurate. In approximately 15% of these cases, there is substantial support for both variants, suggestive of an undocumented polymorphism. An initial screening against a 24-individual genomic DNA diversity panel verified 60% of a small set of potential single nucleotide polymorphisms from which successful results could be obtained. We also find statistical evidence that a few of these discrepancies are due to RNA editing. Overall, these results suggest that the mRNA collections may contain a substantial number of errors. For current and future mRNA collections, it may be prudent to fully reconcile each genome sequence discrepancy, classifying each as a polymorphism, site of RNA editing or somatic cell variation, or genome sequence error.

[1]  D. Church,et al.  Spidey: a tool for mRNA-to-genomic alignments. , 2001, Genome research.

[2]  Brenda L Bass,et al.  RNA editing by adenosine deaminases that act on RNA. , 2002, Annual review of biochemistry.

[3]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[4]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[5]  David Botstein,et al.  SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data , 2003, Nucleic Acids Res..

[6]  M. Adams,et al.  Recent Segmental Duplications in the Human Genome , 2002, Science.

[7]  J C Murray,et al.  Pediatrics and , 1998 .

[8]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[9]  Ryan D. Morin,et al.  The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). , 2004, Genome research.

[10]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[11]  D. Gudbjartsson,et al.  A high-resolution recombination map of the human genome , 2002, Nature Genetics.

[12]  R D Klausner,et al.  The mammalian gene collection. , 1999, Science.

[13]  Hideaki Sugawara,et al.  DBJ in the stream of various biological data , 2004, Nucleic Acids Res..

[14]  R. Gibbs,et al.  Large-scale concatenation cDNA sequencing. , 1997, Genome research.

[15]  G. Rubin,et al.  Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  I. Kanazawa,et al.  Low editing efficiency of GluR2 mRNA is associated with a low relative abundance of ADAR2 mRNA in white matter of normal human brain , 2003, The European journal of neuroscience.

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[19]  Cécile Fizames,et al.  A comprehensive genetic map of the human genome based on 5,264 microsatellites , 1996, Nature.

[20]  R. Myers,et al.  Quality assessment of the human genome sequence , 2004, Nature.

[21]  Frederick Lopez,et al.  Quality assessment of the human genome sequence Schmutz, Jeremy Wheeler, Jane Grimwood, Mark Dickson, Joan Yang, Chenier Caoile, Eva Bajorek, Stacey Black, Yee Man Chan, , 2004 .

[22]  P. Seeburg,et al.  RNA editing in brain controls a determinant of ion flow in glutamate-gated channels , 1991, Cell.

[23]  C. Ross,et al.  RNA Editing of the Glutamate Receptor Subunits GluR2 and GluR6 in Human Brain Tissue , 1994, Journal of neurochemistry.

[24]  G. Rubin,et al.  A Drosophila full-length cDNA resource , 2002, Genome Biology.

[25]  W. Keller,et al.  RNA editing by base deamination: more enzymes, more targets, new mysteries. , 2001, Trends in biochemical sciences.

[26]  E. Birney,et al.  Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs , 2002, Nature.

[27]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[28]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[29]  R. Gibbs,et al.  Concatenation cDNA sequencing for transcriptome analysis. , 2003, Comptes rendus biologies.

[30]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence Project: update and current status , 2003, Nucleic Acids Res..