Accurate detection of mosaic variants in sequencing data without matched controls

Detection of mosaic mutations that arise in normal development is challenging, as such mutations are typically present in only a minute fraction of cells and there is no clear matched control for removing germline variants and systematic artifacts. We present MosaicForecast, a machine-learning method that leverages read-based phasing and read-level features to accurately detect mosaic single-nucleotide variants and indels, achieving a multifold increase in specificity compared with existing algorithms. Using single-cell sequencing and targeted sequencing, we validated 80–90% of the mosaic single-nucleotide variants and 60–80% of indels detected in human brain whole-genome sequencing data. Our method should help elucidate the contribution of mosaic somatic mutations to the origin and development of disease. MosaicForecast detects mosaic single-nucleotide variants and indels in human samples.

[1]  Minseok Kwon,et al.  Linked-read analysis identifies mutations in single-cell DNA-sequencing data , 2019, Nature Genetics.

[2]  Lior Pachter,et al.  RESEARCH ARTICLE Open Access Identification and correction of systematic error in high-throughput sequence data , 2022 .

[3]  Christopher T. Saunders,et al.  Strelka2: fast and accurate calling of germline and somatic variants , 2018, Nature Methods.

[4]  Mauricio O. Carneiro,et al.  Scaling accurate genetic variant discovery to tens of thousands of samples , 2017, bioRxiv.

[5]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[6]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[7]  Leslie G. Biesecker,et al.  A genomic view of mosaicism and human disease , 2013, Nature Reviews Genetics.

[8]  Ryan L. Collins,et al.  The mutational constraint spectrum quantified from variation in 141,456 humans , 2020, Nature.

[9]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[10]  Bernd Bischl,et al.  mlr: Machine Learning in R , 2016, J. Mach. Learn. Res..

[11]  Philip Hugenholtz,et al.  Shining a Light on Dark Sequencing: Characterising Errors in Ion Torrent PGM Data , 2013, PLoS Comput. Biol..

[12]  Exonic Mosaic Mutations Contribute Risk for Autism Spectrum Disorder. , 2017, American journal of human genetics.

[13]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[14]  Genomic mosaicism in paternal sperm and multiple parental tissues in a Dravet syndrome cohort , 2017, Scientific Reports.

[15]  David Haussler,et al.  The UCSC Genome Browser database: 2019 update , 2018, Nucleic Acids Res..

[16]  Anshul Kundaje,et al.  Umap and Bismap: quantifying genome and methylome mappability , 2016, bioRxiv.

[17]  A. Y. Ye,et al.  A model for postzygotic mosaicisms quantifies the allele fraction drift, mutation rate, and contribution to de novo mutations , 2018, Genome research.

[18]  Donald N Freed,et al.  The Contribution of Mosaic Variants to Autism Spectrum Disorder , 2016, PLoS genetics.

[19]  Lovelace J Luquette,et al.  Detecting Somatic Mutations in Normal Cells. , 2018, Trends in genetics : TIG.

[20]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[21]  Bo-Juen Chen,et al.  Different mutational rates and mechanisms in human cells at pregastrulation and neurogenesis , 2018, Science.

[22]  Ryan L. Collins,et al.  Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes , 2019, bioRxiv.

[23]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[24]  Liping Wei,et al.  Distinctive types of postzygotic single-nucleotide mosaicisms in healthy individuals revealed by genome-wide profiling of multiple organs , 2018, bioRxiv.

[25]  W. Wasserman,et al.  The SIN3A histone deacetylase complex is required for a complete transcriptional response to hypoxia , 2017, bioRxiv.

[26]  Peter J. Park,et al.  Aging and neurodegeneration are associated with increased mutations in single human neurons , 2017, Science.

[27]  A. Y. Ye,et al.  Postzygotic single‐nucleotide mosaicisms contribute to the etiology of autism spectrum disorder and autistic traits and the origin of mutations , 2017, Human mutation.

[28]  Michael C. Heinold,et al.  A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing , 2015, Nature Communications.

[29]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[30]  Yossi Farjoun,et al.  Characterization and remediation of sample index swaps by non-redundant dual indexing on massively parallel sequencing platforms , 2017, BMC Genomics.

[31]  Peter J. Park,et al.  Somatic mutation in single human neurons tracks developmental and transcriptional history , 2015, Science.

[32]  Masood Z. Hadi,et al.  Error Rate Comparison during Polymerase Chain Reaction by DNA Polymerase , 2014, Molecular biology international.

[33]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[34]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[35]  Pingfang Liu,et al.  DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification , 2017, Science.

[36]  Liping Wei,et al.  MosaicHunter: accurate detection of postzygotic single-nucleotide mosaicism through next-generation sequencing of unpaired, trio, and paired samples , 2017, Nucleic acids research.

[37]  Christopher S. Poultney,et al.  Rates, Distribution, and Implications of Post-zygotic Mosaic Mutations in Autism Spectrum Disorder , 2017, Nature Neuroscience.

[38]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[39]  M. Hurles,et al.  Somatic mutations reveal asymmetric cellular dynamics in the early human embryo , 2017, Nature.

[40]  Meng Wang,et al.  Postzygotic single-nucleotide mosaicisms in whole-genome sequences of clinically unremarkable individuals , 2014, Cell Research.

[41]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.