Precise detection of de novo single nucleotide variants in human genomes

Significance The precise location of variants in the human genome is of utmost importance. We present a unique approach, coverage-based single nucleotide variant (SNV) identification (COBASI), which uses only perfect matches between the reads of a sequence project and a reference genome to detect and accurately identify de novo SNVs. From the perfect matches, a representation of the read coverage per nucleotide along the genome, the variation landscape, is generated. SNVs are then pinpointed as significant changes in coverage and de novo SNVs can be identified with high precision. The performance of COBASI was analyzed using simulations and experimentally validated by sequencing de novo SNVs identified from a parent–offspring trio. We propose this pipeline as a useful tool for different genomic applications. The precise determination of de novo genetic variants has enormous implications across different fields of biology and medicine, particularly personalized medicine. Currently, de novo variations are identified by mapping sample reads from a parent–offspring trio to a reference genome, allowing for a certain degree of differences. While widely used, this approach often introduces false-positive (FP) results due to misaligned reads and mischaracterized sequencing errors. In a previous study, we developed an alternative approach to accurately identify single nucleotide variants (SNVs) using only perfect matches. However, this approach could be applied only to haploid regions of the genome and was computationally intensive. In this study, we present a unique approach, coverage-based single nucleotide variant identification (COBASI), which allows the exploration of the entire genome using second-generation short sequence reads without extensive computing requirements. COBASI identifies SNVs using changes in coverage of exactly matching unique substrings, and is particularly suited for pinpointing de novo SNVs. Unlike other approaches that require population frequencies across hundreds of samples to filter out any methodological biases, COBASI can be applied to detect de novo SNVs within isolated families. We demonstrate this capability through extensive simulation studies and by studying a parent–offspring trio we sequenced using short reads. Experimental validation of all 58 candidate de novo SNVs and a selection of non-de novo SNVs found in the trio confirmed zero FP calls. COBASI is available as open source at https://github.com/Laura-Gomez/COBASI for any researcher to use.

[1]  A global reference for human genetic variation , 2015, Nature.

[2]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[3]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[4]  Asako Koike,et al.  Ultrafast SNP analysis using the Burrows-Wheeler transform of short-read data , 2015, Bioinform..

[5]  Lauris Kaplinski,et al.  FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads , 2016, Scientific Reports.

[6]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[7]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[8]  A. Wences,et al.  Context-dependent individualization of nucleotides and virtual genomic hybridization allow the precise location of human SNPs , 2011, Proceedings of the National Academy of Sciences.

[9]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[10]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[11]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[12]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[13]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[14]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[15]  Jakob Grove,et al.  Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios , 2015, Nature Communications.

[16]  Alexander Hoischen,et al.  New insights into the generation and role of de novo mutations in health and disease , 2016, Genome Biology.

[17]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[18]  C. Nusbaum,et al.  Quality scores and SNP detection in sequencing-by-synthesis systems. , 2008, Genome research.

[19]  Ronald W. Davis,et al.  Rare variant detection using family-based sequencing analysis , 2013, Proceedings of the National Academy of Sciences.

[20]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[21]  Wei Chen,et al.  A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families , 2012, PLoS genetics.

[22]  Insuk Lee,et al.  Systematic comparison of variant calling pipelines using gold standard personal exome variants , 2015, Scientific Reports.

[23]  Luis A. Aguilar,et al.  A Perfect Match Genomic Landscape Provides a Unified Framework for the Precise Detection of Variation in Natural and Synthetic Haploid Genomes , 2018, Genetics.

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  J. Veltman,et al.  De novo mutations in human genetic disease , 2012, Nature Reviews Genetics.

[26]  Fredrik Vannberg,et al.  Mapping-free variant calling using haplotype reconstruction from k-mer frequencies , 2017, bioRxiv.

[27]  Yufeng Shen,et al.  Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands , 2017, Nature Genetics.

[28]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[29]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[30]  Adam M. Phillippy,et al.  Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies , 2013, Briefings Bioinform..

[31]  Lilia M. Iakoucheva,et al.  Whole-Genome Sequencing in Autism Identifies Hot Spots for De Novo Germline Mutation , 2012, Cell.

[32]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[33]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[34]  Evan E Eichler,et al.  Properties and rates of germline mutations in humans. , 2013, Trends in genetics : TIG.

[35]  S. Lok,et al.  Increased exonic de novo mutation rate in individuals with schizophrenia , 2011, Nature Genetics.

[36]  J. Lupski New mutations and intellectual function , 2010, Nature Genetics.

[37]  Michael F. Walker,et al.  De novo mutations revealed by whole-exome sequencing are strongly associated with autism , 2012, Nature.

[38]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[39]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[40]  Eric Banks,et al.  A framework for the detection of de novo mutations in family-based sequencing data , 2016, European Journal of Human Genetics.