Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population

The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon–intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences.

[1]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[2]  Janet M. Thornton,et al.  Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset , 2013, PLoS Comput. Biol..

[3]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[4]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[5]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[6]  B. Bass,et al.  A phylogenetic analysis reveals an unusual sequence conservation within introns involved in RNA editing. , 2000, RNA.

[7]  Jacob A. Tennessen,et al.  Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes , 2012, Science.

[8]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[9]  Kengo Kinoshita,et al.  Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals , 2015, Nature Communications.

[10]  D. A. Mason,et al.  A Gain-of-function Polymorphism in a G-protein Coupling Domain of the Human β1-Adrenergic Receptor* , 1999, The Journal of Biological Chemistry.

[11]  David Haussler,et al.  Current status and new features of the Consensus Coding Sequence database , 2013, Nucleic Acids Res..

[12]  Alkes L. Price,et al.  Using population admixture to help complete maps of the human genome , 2013, Nature Genetics.

[13]  Lars Bolund,et al.  Building the sequence map of the human pan-genome , 2010, Nature Biotechnology.

[14]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[15]  Tieliu Shi,et al.  Comprehensively identifying and characterizing the missing gene sequences in human reference genome with integrated analytic approaches , 2013, Human Genetics.

[16]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[17]  Youngjin Lee,et al.  Structural Insights into the Quaternary Catalytic Mechanism of Hexameric Human Quinolinate Phosphoribosyltransferase, a Key Enzyme in de novo NAD Biosynthesis , 2016, Scientific Reports.

[18]  Huilin Li,et al.  The Postsynaptic Density Proteins Homer and Shank Form a Polymeric Network Structure , 2009, Cell.

[19]  Xiaofeng Zhu,et al.  Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing , 2014, BMC Genomics.

[20]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[21]  D. Valverde,et al.  Alström syndrome: current perspectives , 2015, The application of clinical genetics.

[22]  Tieliu Shi,et al.  Revealing the missing expressed genes beyond the human reference genome by RNA-Seq , 2011, BMC Genomics.

[23]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[24]  Sergey A. Shiryev,et al.  Single haplotype assembly of the human genome from a hydatidiform mole , 2014, bioRxiv.

[25]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[26]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[27]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[28]  James H Naismith,et al.  Structural and kinetic characterization of quinolinate phosphoribosyltransferase (hQPRTase) from homo sapiens. , 2007, Journal of molecular biology.

[29]  H. Chandler Database , 1985 .

[30]  Eric A. Toth,et al.  The crystal structure of human quinolinic acid phosphoribosyltransferase in complex with its inhibitor phthalic acid , 2013, Proteins.

[31]  Henrik Clausen,et al.  Molecular genetic basis of the histo-blood group ABO system , 1990, Nature.

[32]  E. Eichler,et al.  Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions , 2010, Nature Methods.

[33]  M. Hollenberg,et al.  Glycosylation of human proteinase-activated receptor-2 (hPAR2): role in cell surface expression and signalling. , 2002, The Biochemical journal.

[34]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[35]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[36]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[37]  Jane Loveland,et al.  Tracking and coordinating an international curation effort for the CCDS Project , 2012, Database J. Biol. Databases Curation.

[38]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[39]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[40]  D. A. Mason,et al.  Racial differences in the frequencies of cardiac β1‐adrenergic receptor polymorphisms: Analysis of c145A>G and c1165G>C , 1999, Human mutation.

[41]  Heng Li,et al.  Mapping the human reference genome's missing sequence by three-way admixture in Latino genomes. , 2013, American journal of human genetics.

[42]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[43]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[44]  Kengo Kinoshita,et al.  Distribution of single‐nucleotide variants on protein–protein interaction sites and its relationship with minor allele frequency , 2015, Protein science : a publication of the Protein Society.

[45]  Eli Eisenberg,et al.  RNA-editing-mediated exon evolution , 2007, Genome Biology.

[46]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[47]  Peer Bork,et al.  Systematic identification of novel protein domain families associated with nuclear functions. , 2002, Genome research.

[48]  T Kuner,et al.  Control of kinetic properties of AMPA receptor channels by nuclear RNA editing. , 1994, Science.

[49]  David Haussler,et al.  The UCSC Genome Browser database: 2014 update , 2013, Nucleic Acids Res..

[50]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[51]  Amy E. Hawkins,et al.  DNA sequencing of a cytogenetically normal acute myeloid leukemia genome , 2008, Nature.

[52]  S. Gabriel,et al.  Analysis of 6,515 exomes reveals a recent origin of most human protein-coding variants , 2012, Nature.