RVD2: an ultra-sensitive variant detection model for low-depth heterogeneous next-generation sequencing data

Motivation: Next-generation sequencing technology is increasingly being used for clinical diagnostic tests. Clinical samples are often genomically heterogeneous due to low sample purity or the presence of genetic subpopulations. Therefore, a variant calling algorithm for calling low-frequency polymorphisms in heterogeneous samples is needed. Results: We present a novel variant calling algorithm that uses a hierarchical Bayesian model to estimate allele frequency and call variants in heterogeneous samples. We show that our algorithm improves upon current classifiers and has higher sensitivity and specificity over a wide range of median read depth and minor allele fraction. We apply our model and identify 15 mutated loci in the PAXP1 gene in a matched clinical breast ductal carcinoma tumor sample; two of which are likely loss-of-heterozygosity events. Availability and implementation: http://genomics.wpi.edu/rvd2/. Contact: pjflaherty@wpi.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[2]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[3]  Jay Shendure,et al.  Noninvasive Whole-Genome Sequencing of a Human Fetus , 2012, Science Translational Medicine.

[4]  R. Turner,et al.  Homeostasis model assessment: insulin resistance and β-cell function from fasting plasma glucose and insulin concentrations in man , 1985, Diabetologia.

[5]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[6]  Charles J. Geyer,et al.  Practical Markov Chain Monte Carlo , 1992 .

[7]  Nancy R. Zhang,et al.  Ultrasensitive detection of rare mutations using next-generation targeted resequencing , 2011, Nucleic acids research.

[8]  Philip Quirke,et al.  Accurately Identifying Low‐Allelic Fraction Variants in Single Samples with Next‐Generation Sequencing: Applications in Tumor Subclone Resolution , 2013, Human mutation.

[9]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[10]  Michael W. Deem,et al.  Strict detailed balance is unnecessary in Monte Carlo simulation , 1999 .

[11]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[12]  M. Capobianchi,et al.  Next-generation sequencing technology in clinical virology. , 2013, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[13]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[14]  D. Kwiatkowski,et al.  Optimizing illumina next-generation sequencing library preparation for extremely at-biased genomes , 2012, BMC Genomics.

[15]  Semyon Kruglyak,et al.  Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms , 2013, Bioinform..

[16]  Martin J. Wainwright,et al.  A variational principle for graphical models , 2005 .

[17]  Christian P. Robert,et al.  Large-scale inference , 2010 .

[18]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[19]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[20]  Chris D. Greenman,et al.  The Relative Timing of Mutations in a Breast Cancer Genome , 2013, PloS one.

[21]  J. Troge,et al.  Inferring tumor progression from genomic heterogeneity. , 2010, Genome research.

[22]  K. Robasky,et al.  The role of replicates for error mitigation in next-generation sequencing , 2013, Nature Reviews Genetics.

[23]  W. Gilks,et al.  Adaptive Rejection Metropolis Sampling Within Gibbs Sampling , 1995 .

[24]  Vladimir Pavlovic,et al.  A graphical model framework for coupling MRFs and deformable models , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[25]  Stephen M. Smith,et al.  Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm , 2001, IEEE Transactions on Medical Imaging.

[26]  Hanlee P Ji,et al.  RVD: a command-line program for ultrasensitive rare single nucleotide variant detection using targeted next-generation DNA resequencing , 2012, BMC Research Notes.

[27]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[28]  P. Edwards,et al.  Large duplications at reciprocal translocation breakpoints that might be the counterpart of large deletions and could arise from stalled replication bubbles. , 2011, Genome research.

[29]  Luc Devroye,et al.  Sample-based non-uniform random variate generation , 1986, WSC '86.

[30]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[31]  V P Collins,et al.  Array painting reveals a high frequency of balanced translocations in breast cancer cell lines that break in cancer-relevant genes , 2008, Oncogene.

[32]  P. Laird Principles and challenges of genome-wide DNA methylation analysis , 2010, Nature Reviews Genetics.

[33]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[34]  R. Wilson,et al.  The Next-Generation Sequencing Revolution and Its Impact on Genomics , 2013, Cell.

[35]  Wendy S. W. Wong,et al.  Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs , 2012, Bioinform..

[36]  Oliver Sieber,et al.  A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data , 2010, Genome Biology.

[37]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[38]  J. Shendure,et al.  Materials and Methods Som Text Figs. S1 and S2 Tables S1 to S4 References Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome , 2022 .

[39]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[40]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[41]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[42]  B. Ren,et al.  Mapping Human Epigenomes , 2013, Cell.

[43]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[44]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[45]  Patrick Flaherty,et al.  GLAD: a mixed-membership model for heterogeneous tumor subtype classification , 2015, Bioinform..

[46]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[47]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[48]  H. C. Fan,et al.  Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal blood , 2008, Proceedings of the National Academy of Sciences.

[49]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[50]  Stan Z. Li,et al.  Markov Random Field Modeling in Image Analysis , 2001, Computer Science Workbench.

[51]  Terrence J. Sejnowski,et al.  A Variational Principle for Graphical Models , 2007 .

[52]  W. Wong,et al.  ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells , 2009, Proceedings of the National Academy of Sciences.

[53]  Bradley Efron,et al.  Large-scale inference , 2010 .

[54]  Stan Z. Li,et al.  Markov Random Field Modeling in Computer Vision , 1995, Computer Science Workbench.

[55]  Michael I. Jordan,et al.  Variational methods for the Dirichlet process , 2004, ICML.

[56]  M. Ronaghi,et al.  Real-time DNA sequencing using detection of pyrophosphate release. , 1996, Analytical biochemistry.

[57]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[58]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[59]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[60]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[61]  D. M. Titterington,et al.  Bayesian Methods for Neural Networks and Related Models , 2004 .

[62]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[63]  N. Metropolis,et al.  The Monte Carlo method. , 1949 .

[64]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[65]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[66]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[67]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[68]  Dani Gamerman,et al.  Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference , 1997 .

[69]  Timothy R. C. Read,et al.  Multinomial goodness-of-fit tests , 1984 .

[70]  Timothy B. Stockwell,et al.  Deep sequencing reveals mixed infection with 2009 pandemic influenza A (H1N1) virus strains and the emergence of oseltamivir resistance. , 2011, The Journal of infectious diseases.

[71]  W. Gilks Markov Chain Monte Carlo , 2005 .

[72]  Michael I. Jordan Graphical Models , 2003 .

[73]  Prakash P. Shenoy,et al.  Valuation-Based Systems for Bayesian Decision Analysis , 1992, Oper. Res..

[74]  Francesco Vallania,et al.  Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. , 2014, The Journal of molecular diagnostics : JMD.

[75]  Chong Wang,et al.  Variational inference in nonconjugate models , 2012, J. Mach. Learn. Res..