Needlestack: an ultra-sensitive variant caller for multi-sample next generation sequencing data

The emergence of Next-Generation Sequencing (NGS) has revolutionized the way of reaching a genome sequence, with the promise of potentially providing a comprehensive characterization of DNA variations. Nevertheless, detecting somatic mutations is still a difficult problem, in particular when trying to identify low abundance mutations such as subclonal mutations, tumour-derived alterations in body fluids or somatic mutations from histological normal tissue. The main challenge is to precisely distinguish between sequencing artefacts and true mutations, particularly when the latter are so rare they reach similar abundance levels as artefacts. Here, we present needlestack, a highly sensitive variant caller, which directly learns from the data the level of systematic sequencing errors to accurately call mutations. Needlestack is based on the idea that the sequencing error rate can be dynamically estimated from analyzing multiple samples together. We show that the sequencing error rate varies across alterations, illustrating the need to precisely estimate it. We evaluate the performance of needlestack for various types of variations, and we show that needlestack is robust among positions and outperforms existing state-of-the-art method for low abundance mutations. Needlestack, along with its source code is freely available on the GitHub plateform: https://github.com/IARCbioinfo/needlestack.

[1]  David Laehnemann,et al.  Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction , 2015, Briefings Bioinform..

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Philip Hugenholtz,et al.  Shining a Light on Dark Sequencing: Characterising Errors in Ion Torrent PGM Data , 2013, PLoS Comput. Biol..

[4]  Jeffrey Perkel,et al.  Democratic databases: science on GitHub , 2016, Nature.

[5]  Qingguo Wang,et al.  Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives , 2013, BMC Bioinformatics.

[6]  J. Salk Clonal evolution in cancer , 2010 .

[7]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[8]  Steven J. M. Jones,et al.  Comprehensive genomic characterization of squamous cell lung cancers , 2012, Nature.

[9]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[10]  Peter J. Campbell,et al.  Subclonal variant calling with multiple samples and prior knowledge , 2014, Bioinform..

[11]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[12]  Joshua M. Stuart,et al.  Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection , 2015, Nature Methods.

[13]  M. Stratton,et al.  High burden and pervasive positive selection of somatic mutations in normal human skin , 2015, Science.

[14]  Tingting Jiang,et al.  Reliability of Whole-Exome Sequencing for Assessing Intratumor Genetic Heterogeneity , 2018, bioRxiv.

[15]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[16]  Mauricio O. Carneiro,et al.  Scaling accurate genetic variant discovery to tens of thousands of samples , 2017, bioRxiv.

[17]  T. LaFramboise,et al.  Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances , 2009, Nucleic acids research.

[18]  Alexander Schönhuth,et al.  Discovering motifs that induce sequencing errors , 2013, BMC Bioinformatics.

[19]  M. Olivier,et al.  Identification of Circulating Tumor DNA for the Early Detection of Small-cell Lung Cancer , 2016, EBioMedicine.

[20]  Matthew D. Wilkerson,et al.  ABRA: improved coding indel detection via assembly-based realignment , 2014, Bioinform..

[21]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[22]  Viola Ravasio,et al.  GARFIELD-NGS: Genomic vARiants FIltering by dEep Learning moDels in NGS , 2017, bioRxiv.

[23]  Michael C. Heinold,et al.  A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing , 2015, Nature Communications.

[24]  Chang Xu,et al.  A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data , 2018, Computational and structural biotechnology journal.

[25]  Klaus Pantel,et al.  Cell-free nucleic acids as biomarkers in cancer patients , 2011, Nature Reviews Cancer.

[26]  William H. Aeberhard,et al.  Robust inference in the negative binomial regression model with an application to falls data , 2014, Biometrics.

[27]  Günter Mayer,et al.  Systematic evaluation of error rates and causes in short samples in next-generation sequencing , 2018, Scientific Reports.

[28]  Peter J. Campbell,et al.  Somatic mutant clones colonize the human esophagus with age , 2018, Science.

[29]  Martin Vingron,et al.  Comprehensive genomic profiles of small cell lung cancer , 2015, Nature.

[30]  S. Linnarsson,et al.  Counting absolute numbers of molecules using unique molecular identifiers , 2011, Nature Methods.

[31]  Trevor Hastie,et al.  REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. , 2016, American journal of human genetics.

[32]  P. A. Futreal,et al.  Circulating tumor DNA analysis depicts subclonal architecture and genomic evolution of small cell lung cancer , 2018, Nature Communications.

[33]  Pingfang Liu,et al.  DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification , 2017, Science.

[34]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[35]  M. Emond,et al.  Accuracy of Next Generation Sequencing Platforms. , 2014, Next generation, sequencing & applications.