Origins and characterization of variants shared between databases of somatic and germline human mutations

Background Mutations arise in the human genome in two major settings: the germline and the soma. These settings involve different inheritance patterns, time scales, chromatin structures, and environmental exposures, all of which impact the resulting distribution of substitutions. Nonetheless, many of the same single nucleotide variants (SNVs) are shared between germline and somatic mutation databases, such as between the gnomAD database of 120,000 germline exomes and the TCGA database of 10,000 somatic exomes. Here, we sought to explain this overlap. Results After strict filtering to exclude common germline polymorphisms and sites with poor coverage or mappability, we found 336,987 variants shared between the somatic and germline databases. A uniform statistical model explains 34% of these shared variants; a model that incorporates the varying mutation rates of the basic mutation types explains another 50% of shared variants; and a model that includes extended nucleotide contexts (e.g. surrounding 3 bases on either side) explains an additional 4% of shared variants. Analysis of read depth finds mixed evidence that up to 4% of the shared variants may represent germline variants leaked into somatic call sets. 9% of the shared variants are not explained by any model. Sequencing errors and convergent evolution did not account for these. We surveyed other factors as well: Cancers driven by endogenous mutational processes share a greater fraction of variants with the germline, and recently derived germline variants were more likely to be somatically shared than were ancient germline ones. Conclusions Overall, we find that shared variants largely represent bona fide biological occurrences of the same variant in the germline and somatic setting and arise primarily because DNA has some of the same basic chemical vulnerabilities in either setting. Moreover, we find mixed evidence that somatic call-sets leak appreciable numbers of germline variants, which is relevant to genomic privacy regulations. In future studies, the similar chemical vulnerability of DNA between the somatic and germline settings might be used to help identify disease-related genes by guiding the development of background-mutation models that are informed by both somatic and germline patterns of variation.

[1]  Li Ding,et al.  Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. , 2018, Cell systems.

[2]  M. Stratton,et al.  Universal Patterns of Selection in Cancer and Somatic Tissues , 2018, Cell.

[3]  Paz Polak,et al.  Cell-of-origin chromatin organization shapes the mutational landscape of cancer , 2015, Nature.

[4]  W. Fitch,et al.  Evidence suggesting a non-random character to nucleotide replacements in naturally occurring mutations. , 1967, Journal of molecular biology.

[5]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[6]  A. Morris,et al.  Cancer etiology. Variation in cancer risk among tissues can be explained by the number of stem cell divisions , 2015, BDJ.

[7]  L. Duret,et al.  Adaptation or biased gene conversion? Extending the null hypothesis of molecular evolution. , 2007, Trends in genetics : TIG.

[8]  Hugo Y. K. Lam,et al.  Performance comparison of exome DNA sequencing technologies , 2011, Nature Biotechnology.

[9]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[10]  Seok-Gu Kang,et al.  Brain somatic mutations in MTOR cause focal cortical dysplasia type II leading to intractable epilepsy , 2015, Nature Medicine.

[11]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[12]  Stephen Alfred Forbes,et al.  On the Local Distribution of Certain Illinois Fishes: An Essay in Statistical Ecology , 1907 .

[13]  Gil McVean,et al.  Dating genomic variants and shared ancestry in population-scale sequencing data , 2018, bioRxiv.

[14]  Steven J. M. Jones,et al.  Integrated genomic characterization of endometrial carcinoma , 2013, Nature.

[15]  P. Green,et al.  Widespread Genomic Signatures of Natural Selection in Hominid Evolution , 2009, PLoS genetics.

[16]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[17]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[18]  Paul C. Boutros,et al.  Germline contamination and leakage in whole genome somatic single nucleotide variant detection , 2017, bioRxiv.

[19]  Ryan L. Collins,et al.  Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes , 2019, bioRxiv.

[20]  Yang I Li,et al.  An Expanded View of Complex Traits: From Polygenic to Omnigenic , 2017, Cell.

[21]  Benjamin F. Voight,et al.  Nature Genetics Advance Online Publication a N a Ly S I S an Expanded Sequence Context Model Broadly Explains Variability in Polymorphism Levels across the Human Genome , 2022 .

[22]  David T. W. Jones,et al.  Signatures of mutational processes in human cancer , 2013, Nature.

[23]  David G. Knowles,et al.  Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[24]  Chris Sander,et al.  Emerging landscape of oncogenic signatures across human cancers , 2013, Nature Genetics.

[25]  Arthur Wuster,et al.  Timing, rates and spectra of human germline mutation , 2015, Nature Genetics.

[26]  Chi-Ching Lee,et al.  mSignatureDB: a database for deciphering mutational signatures in human cancers , 2017, Nucleic Acids Res..

[27]  Yann Joly,et al.  The International Cancer Genome Consortium's evolving data-protection policies , 2014, Nature Biotechnology.

[28]  P. Visscher,et al.  Nature Genetics Advance Online Publication , 2022 .

[29]  T. Mikkelsen,et al.  The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[30]  L. Parada,et al.  Cell of origin of glioma: biological and clinical implications , 2016, British Journal of Cancer.

[31]  Terence Hwa,et al.  Distinct changes of genomic biases in nucleotide substitution at the time of Mammalian radiation. , 2003, Molecular biology and evolution.

[32]  B. Vogelstein,et al.  Variation in cancer risk among tissues can be explained by the number of stem cell divisions , 2015, Science.

[33]  G. Pfeifer Mutagenesis at methylated CpG sequences. , 2006, Current topics in microbiology and immunology.

[34]  Shamil Sunyaev,et al.  Bayesian inference of negative and positive selection in human cancers , 2017, Nature Genetics.

[35]  B. Rannala,et al.  Likelihood models of somatic mutation and codon substitution in cancer genes. , 2003, Genetics.

[36]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[37]  Peter J. Campbell,et al.  Transmissible Dog Cancer Genome Reveals the Origin and History of an Ancient Cell Lineage , 2014, Science.

[38]  Raphael A. Bernier,et al.  denovo-db: a compendium of human de novo variants , 2016, Nucleic Acids Res..

[39]  K. A. Ross Coherent Somatic Mutation in Autoimmune Disease , 2014, PloS one.

[40]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.