geck: trio-based comparative benchmarking of variant calls

Motivation: Classical methods of comparing the accuracies of variant calling pipelines are based on truth sets of variants whose genotypes are previously determined with high confidence. An alternative way of performing benchmarking is based on Mendelian constraints between related individuals. Statistical analysis of Mendelian violations can provide truth set‐independent benchmarking information, and enable benchmarking less‐studied variants and diverse populations. Results: We introduce a statistical mixture model for comparing two variant calling pipelines from genotype data they produce after running on individual members of a trio. We determine the accuracy of our model by comparing the precision and recall of GATK Unified Genotyper and Haplotype Caller on the high‐confidence SNPs of the NIST Ashkenazim trio and the two independent Platinum Genome trios. We show that our method is able to estimate differential precision and recall between the two pipelines with Symbol uncertainty. Symbol. No caption available. Availability and implementation: The Python library geck, and usage examples are available at the following URL: https://github.com/sbg/geck, under the GNU General Public License v3. Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  J. Veltman,et al.  De novo mutations in human genetic disease , 2012, Nature Reviews Genetics.

[2]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[3]  Deniz Kural,et al.  Comparing complex variants in family trios , 2018, bioRxiv.

[4]  Jinliang Wang,et al.  Sibship reconstruction from genetic data with typing errors. , 2004, Genetics.

[5]  L. Jostins Inferring genotyping error rates from genotyped trios , 2011, 1109.1462.

[6]  H. Skaug,et al.  Estimating genotyping error rates from parent–offspring dyads , 2013 .

[7]  Michael Krawczak,et al.  Family-Based Benchmarking of Copy Number Variation Detection Software , 2015, PloS one.

[8]  Masao Nagasaki,et al.  A statistical variant calling approach from pedigree information and local haplotyping with phase informative reads , 2013, Bioinform..

[9]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[10]  Michael Boehnke,et al.  Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. , 2002, American journal of human genetics.

[11]  Dan Geiger,et al.  Integration of SNP genotyping confidence scores in IBD inference , 2011, Bioinform..

[12]  Brian L Browning,et al.  Detecting identity by descent and estimating genotype error rates in sequence data. , 2013, American journal of human genetics.

[13]  T. Spector,et al.  Parametric model‐based statistics for possible genotyping errors and sample stratification in sibling‐pair SNP data , 2009, Genetic epidemiology.

[14]  Ronald W. Davis,et al.  Rare variant detection using family-based sequencing analysis , 2013, Proceedings of the National Academy of Sciences.

[15]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[16]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[17]  Insuk Lee,et al.  Systematic comparison of variant calling pipelines using gold standard personal exome variants , 2015, Scientific Reports.

[18]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[19]  Cheng Li,et al.  Estimation of genotype error rate using samples with pedigree information--an application on the GeneChip Mapping 10K array. , 2004, Genomics.

[20]  Chittibabu Guda,et al.  A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference , 2015, BioMed research international.

[21]  Yun S. Song,et al.  The Simons Genome Diversity Project: 300 genomes from 142 diverse populations , 2016, Nature.

[22]  Jeanette C Papp,et al.  Detection and integration of genotyping errors in statistical genetics. , 2002, American journal of human genetics.

[23]  Sarah Sandmann,et al.  Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data , 2017, Scientific Reports.

[24]  Li Fang,et al.  Evaluation on Efficient Detection of Structural Variants at Low Coverage by Long-Read Sequencing , 2016 .

[25]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[26]  Marc L. Salit,et al.  Best practices for evaluating single nucleotide variant calling methods for microbial genomics , 2015, Front. Genet..

[27]  J. Shendure,et al.  Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[28]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[29]  F. Kronenberg,et al.  American Journal of Epidemiology Practice of Epidemiology Estimating the Single Nucleotide Polymorphism Genotype Misclassification from Routine Double Measurements in a Large Epidemiologic Sample , 2022 .

[30]  Andrea Califano,et al.  Toward better benchmarking: challenge-based methods assessment in cancer genomics , 2014, Genome Biology.

[31]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[32]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[33]  Wei Chen,et al.  Genotype calling and haplotyping in parent-offspring trios , 2013, Genome research.

[34]  Andrew Carroll,et al.  Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes , 2015, PloS one.

[35]  D. Haydon,et al.  Maximum-Likelihood Estimation of Allelic Dropout and False Allele Error Rates From Microsatellite Genotypes in the Absence of Reference Data , 2007, Genetics.

[36]  Wolfgang Losert,et al.  svclassify: a method to establish benchmark structural variant calls , 2015, BMC Genomics.

[37]  Tiago M. Fragoso,et al.  Bayesian Model Averaging: A Systematic Review and Conceptual Classification , 2015, 1509.08864.

[38]  G. N. Hannan,et al.  Estimating genotyping error rates from Mendelian errors in SNP array genotypes and their impact on inference. , 2007, Genomics.

[39]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[40]  Yun S. Song,et al.  SMaSH: a benchmarking toolkit for human genome variant calling , 2013, Bioinform..

[41]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..