Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery

Accurate discovery of somatic variants is of central importance in cancer research. However, count statistics on discovered somatic insertions and deletions (indels) indicate that large amounts of discoveries are missed because of the quantification of uncertainties related to gap and alignment ambiguities, twilight zone indels, cancer heterogeneity, sample purity, sampling, and strand bias. We provide a unifying statistical model whose dependency structures enable accurate quantification of all inherent uncertainties in short time. Consequently, false discovery rate (FDR) in somatic indel discovery can now be controlled at utmost accuracy, increasing the amount of true discoveries while safely suppressing the FDR.

[1]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[2]  Marc J. Williams,et al.  Identification of neutral tumor evolution across cancer types , 2016, Nature Genetics.

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[5]  C. Robert,et al.  Optimal Sample Size for Multiple Testing : the Case of Gene Expression Mi roarraysPeter , 2004 .

[6]  T. Speed,et al.  GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. , 2017, Genome research.

[7]  Steven J. M. Jones,et al.  A somatic reference standard for cancer genome sequencing , 2016, Scientific Reports.

[8]  S. Phinn,et al.  Australian vegetated coastal ecosystems as global hotspots for climate change mitigation , 2019, Nature Communications.

[9]  Alexander Zelikovsky,et al.  Computational Methods for Next Generation Sequencing Data Analysis , 2016 .

[10]  Vladimir Vacic,et al.  Genome-wide somatic variant calling using localized colored de Bruijn graphs , 2018, Communications Biology.

[11]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[12]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[13]  W. Xiao,et al.  Robust Cancer Mutation Detection with Deep Learning Models Derived from Tumor-Normal Sequencing Data , 2019, bioRxiv.

[14]  Y. Kamatani,et al.  Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing , 2019, Genome Biology.

[15]  N. McGranahan,et al.  The causes and consequences of genetic heterogeneity in cancer evolution , 2013, Nature.

[16]  James T. Robinson,et al.  Variant Review with the Integrative Genomics Viewer. , 2017, Cancer research.

[17]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[18]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[19]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[20]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[21]  klaguia International Network of Cancer Genome Projects , 2010 .

[22]  Hugo Y. K. Lam,et al.  Deep convolutional neural networks for accurate somatic mutation detection , 2018, Nature Communications.

[23]  Iman Hajirasouliha,et al.  MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels , 2013, Bioinform..

[24]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[25]  Alexander Schönhuth,et al.  Repeat- and error-aware comparison of deletions , 2015, Bioinform..

[26]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[27]  P. Müller,et al.  Optimal Sample Size for Multiple Testing , 2004 .

[28]  Lars Arvestad,et al.  Structural Variation Detection with Read Pair Information: An Improved Null Hypothesis Reduces Bias , 2017, J. Comput. Biol..

[29]  Knut Reinert,et al.  Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone , 2014, Bioinform..

[30]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[31]  Michael C. Heinold,et al.  A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing , 2015, Nature Communications.

[32]  James O. Berger,et al.  Modularization in Bayesian analysis, with emphasis on analysis of computer models , 2009 .

[33]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[34]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[35]  L. Ding,et al.  novoBreak: local assembly for breakpoint detection in cancer genomes , 2016, Nature Methods.

[36]  Giulia Galotto,et al.  Unique Molecular Identifiers reveal a novel sequencing artefact with implications for RNA-Seq based gene expression analysis , 2018, Scientific Reports.

[37]  Alexander Schliep,et al.  CLEVER: clique-enumerating variant finder , 2012, Bioinform..

[38]  Jay Shendure,et al.  Classification and characterization of microsatellite instability across 18 cancer types , 2016, Nature Medicine.

[39]  Alexandre Z. Caldeira,et al.  Uncertainty in homology inferences: assessing and improving genomic sequence alignment. , 2008, Genome research.

[40]  Weitai Huang,et al.  SMuRF: portable and accurate ensemble prediction of somatic mutations , 2019, Bioinform..

[41]  Mauricio O. Carneiro,et al.  Scaling accurate genetic variant discovery to tens of thousands of samples , 2017, bioRxiv.

[42]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[43]  Benjamin Neale,et al.  A synthetic-diploid benchmark for accurate variant calling evaluation , 2018, Nature Methods.

[44]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[45]  Sven Rahmann,et al.  Genome analysis , 2022 .

[46]  Wendy S. W. Wong,et al.  Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs , 2012, Bioinform..

[47]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[48]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[49]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[50]  Gad Getz,et al.  Analysis of somatic microsatellite indels identifies driver events in human tumors , 2017, Nature Biotechnology.

[51]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[52]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[53]  Alexander Schönhuth,et al.  Discovering motifs that induce sequencing errors , 2013, BMC Bioinformatics.