Reducing False‐Positive Incidental Findings with Ensemble Genotyping and Logistic Regression Based Variant Filtering Methods

As whole genome sequencing (WGS) uncovers variants associated with rare and common diseases, an immediate challenge is to minimize false‐positive findings due to sequencing and variant calling errors. False positives can be reduced by combining results from orthogonal sequencing methods, but costly. Here, we present variant filtering approaches using logistic regression (LR) and ensemble genotyping to minimize false positives without sacrificing sensitivity. We evaluated the methods using paired WGS datasets of an extended family prepared using two sequencing platforms and a validated set of variants in NA12878. Using LR or ensemble genotyping based filtering, false‐negative rates were significantly reduced by 1.1‐ to 17.8‐fold at the same levels of false discovery rates (5.4% for heterozygous and 4.5% for homozygous single nucleotide variants (SNVs); 30.0% for heterozygous and 18.7% for homozygous insertions; 25.2% for heterozygous and 16.6% for homozygous deletions) compared to the filtering based on genotype quality scores. Moreover, ensemble genotyping excluded > 98% (105,080 of 107,167) of false positives while retaining > 95% (897 of 937) of true positives in de novo mutation (DNM) discovery in NA12878, and performed better than a consensus method using two sequencing platforms. Our proposed methods were effective in prioritizing phenotype‐associated variants, and an ensemble genotyping would be essential to minimize false‐positive DNM candidates.

[1]  K. Boycott,et al.  Rare-disease genetics in the era of next-generation sequencing: discovery to translation , 2013, Nature Reviews Genetics.

[2]  Y. Pawitan,et al.  A new paradigm emerges from the study of de novo mutations in the context of neurodevelopmental disease , 2013, Molecular Psychiatry.

[3]  J. Licht,et al.  DNMT3A mutations in acute myeloid leukemia , 2011, Nature Genetics.

[4]  S. Scherer,et al.  Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing. , 2013, American journal of human genetics.

[5]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[6]  J. Veltman,et al.  De novo mutations in human genetic disease , 2012, Nature Reviews Genetics.

[7]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[8]  Shashikant Kulkarni,et al.  Assuring the quality of next-generation sequencing in clinical laboratory practice , 2012, Nature Biotechnology.

[9]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[10]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[11]  Wei Chen,et al.  A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families , 2012, PLoS genetics.

[12]  You-Qiang Song,et al.  Evaluation of next-generation sequencing software in mapping and assembly , 2011, Journal of Human Genetics.

[13]  Robert B. Hartlage,et al.  This PDF file includes: Materials and Methods , 2009 .

[14]  Murat Sincan,et al.  Detecting false‐positive signals in exome sequencing , 2012, Human mutation.

[15]  Xiaoqing Yu,et al.  How do alignment programs perform on sequencing data with varying qualities and from repetitive regions? , 2012, BioData Mining.

[16]  Søren Brunak,et al.  Whole-exome sequencing of 2,000 Danish individuals and the role of rare coding variants in type 2 diabetes. , 2013, American journal of human genetics.

[17]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[18]  H. Hakonarson,et al.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing , 2013, Genome Medicine.

[19]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[20]  P. D. Rijk,et al.  Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing , 2011, Nature Biotechnology.

[21]  P. Stenson,et al.  The Human Gene Mutation Database: 2008 update , 2009, Genome Medicine.

[22]  Lilia M. Iakoucheva,et al.  Whole-Genome Sequencing in Autism Identifies Hot Spots for De Novo Germline Mutation , 2012, Cell.

[23]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[24]  Heidi L. Rehm,et al.  Disease-targeted sequencing: a cornerstone in the clinic , 2013, Nature Reviews Genetics.

[25]  E. Birney,et al.  A small cell lung cancer genome reports complex tobacco exposure signatures , 2009, Nature.

[26]  Euan A Ashley,et al.  Performance comparison of whole-genome sequencing platforms , 2011, Nature Biotechnology.

[27]  Daniel M Bader,et al.  A beginners guide to SNP calling from high-throughput DNA-sequencing data , 2012, Human Genetics.

[28]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[29]  Evan T. Geller,et al.  Patterns and rates of exonic de novo mutations in autism spectrum disorders , 2012, Nature.

[30]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[31]  I. Kohane,et al.  Taxonomizing, sizing, and overcoming the incidentalome , 2012, Genetics in Medicine.

[32]  Sek Won Kong,et al.  gSearch: a fast and flexible general search tool for whole-genome sequencing , 2012, Bioinform..

[33]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[34]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[35]  Karen Eilbeck,et al.  A standard variation file format for human genome sequences , 2010, Genome Biology.

[36]  Stephen C. J. Parker,et al.  Accurate and comprehensive sequencing of personal genomes. , 2011, Genome research.

[37]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[38]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[39]  David P Bick,et al.  Making a definitive diagnosis: Successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease , 2011, Genetics in Medicine.

[40]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[41]  W. Miller,et al.  Comparison of Sequencing Platforms for Single Nucleotide Variant Calls in a Human Sample , 2013, PloS one.

[42]  M. DePristo,et al.  Variation in genome-wide mutation rates within and between human families , 2011, Nature Genetics.

[43]  D. Goldstein,et al.  Uncovering the roles of rare variants in common disease through whole-genome sequencing , 2010, Nature Reviews Genetics.

[44]  Aleksandar Milosavljevic,et al.  An integrative variant analysis suite for whole exome next-generation sequencing data , 2012, BMC Bioinformatics.

[45]  Magalie S Leduc,et al.  Clinical whole-exome sequencing for the diagnosis of mendelian disorders. , 2013, The New England journal of medicine.

[46]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .