Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers

BackgroundNext-generation sequencing of matched tumor and normal biopsy pairs has become a technology of paramount importance for precision cancer treatment. Sequencing costs have dropped tremendously, allowing the sequencing of the whole exome of tumors for just a fraction of the total treatment costs. However, clinicians and scientists cannot take full advantage of the generated data because the accuracy of analysis pipelines is limited. This particularly concerns the reliable identification of subclonal mutations in a cancer tissue sample with very low frequencies, which may be clinically relevant.ResultsUsing simulations based on kidney tumor data, we compared the performance of nine state-of-the-art variant callers, namely deepSNV, GATK HaplotypeCaller, GATK UnifiedGenotyper, JointSNVMix2, MuTect, SAMtools, SiNVICT, SomaticSniper, and VarScan2. The comparison was done as a function of variant allele frequencies and coverage. Our analysis revealed that deepSNV and JointSNVMix2 perform very well, especially in the low-frequency range. We attributed false positive and false negative calls of the nine tools to specific error sources and assigned them to processing steps of the pipeline. All of these errors can be expected to occur in real data sets. We found that modifying certain steps of the pipeline or parameters of the tools can lead to substantial improvements in performance. Furthermore, a novel integration strategy that combines the ranks of the variants yielded the best performance. More precisely, the rank-combination of deepSNV, JointSNVMix2, MuTect, SiNVICT and VarScan2 reached a sensitivity of 78% when fixing the precision at 90%, and outperformed all individual tools, where the maximum sensitivity was 71% with the same precision.ConclusionsThe choice of well-performing tools for alignment and variant calling is crucial for the correct interpretation of exome sequencing data obtained from mixed samples, and common pipelines are suboptimal. We were able to relate observed substantial differences in performance to the underlying statistical models of the tools, and to pinpoint the error sources of false positive and false negative calls. These findings might inspire new software developments that improve exome sequencing pipelines and further the field of precision cancer treatment.

[1]  D. Spandidos,et al.  Emerging targeted therapies for melanoma treatment (Review) , 2014, International journal of oncology.

[2]  P. A. Futreal,et al.  Genomic architecture and evolution of clear cell renal cell carcinomas defined by multiregion sequencing , 2014, Nature Genetics.

[3]  I. Cuesta,et al.  Comparison of variant calling methods in exome sequencing of matched tumor-normal sample pairs , 2013 .

[4]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[5]  K. Polyak,et al.  Intra-tumour heterogeneity: a looking glass for cancer? , 2012, Nature Reviews Cancer.

[6]  Denis C. Bauer Variant calling comparison CASAVA1.8 and GATK , 2011 .

[7]  Benjamin J. Raphael,et al.  Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine , 2014, Genome Medicine.

[8]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[9]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[10]  Ravi Vijaya Satya,et al.  Comparison of somatic mutation calling methods in amplicon and whole exome sequence data , 2014, BMC Genomics.

[11]  R. Elston,et al.  Choosing an optimal method to combine P‐values , 2009, Statistics in medicine.

[12]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[13]  Faraz Hach,et al.  SiNVICT: ultra-sensitive detection of single nucleotide variants and indels in circulating tumour DNA , 2017, Bioinform..

[14]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[15]  Francesco Vallania,et al.  Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. , 2014, The Journal of molecular diagnostics : JMD.

[16]  Xiaoqing Yu,et al.  Comparing a few SNP calling algorithms using low-coverage sequencing data , 2013, BMC Bioinformatics.

[17]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[18]  Tatiana Popova,et al.  Supplementary Methods , 2012, Acta Neuropsychiatrica.

[19]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[20]  P. Humphrey,et al.  The 2016 WHO Classification of Tumours of the Urinary System and Male Genital Organs-Part A: Renal, Penile, and Testicular Tumours. , 2016, European urology.

[21]  Thomas Seufferlein,et al.  Targeted treatments in colorectal cancer: state of the art and future perspectives , 2010, Gut.

[22]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[23]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[24]  Vineet Bafna,et al.  Wessim: a whole-exome sequencing simulator based on in silico exome capture , 2013, Bioinform..

[25]  Brandi L. Cantarel,et al.  BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity , 2014, BMC Bioinformatics.

[26]  S. Kunte,et al.  Statistical computing , 1999 .

[27]  Gholamreza Haffari,et al.  Feature-based classifiers for somatic mutation detection in tumour–normal paired sequencing data , 2011, Bioinform..

[28]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[29]  T. Takano,et al.  Olaparib in platinum-sensitive ovarian cancer. , 2012, The New England journal of medicine.

[30]  Ana Conesa,et al.  Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data , 2015, Bioinform..

[31]  Kristian Cibulskis,et al.  ContEst: estimating cross-contamination of human samples in next-generation sequencing data , 2011, Bioinform..

[32]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[33]  Peilin Jia,et al.  Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers , 2013, Genome Medicine.

[34]  J. Potash,et al.  Validation and assessment of variant calling pipelines for next-generation sequencing , 2014, Human Genomics.

[35]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[36]  S. Gabriel,et al.  EGFR Mutations in Lung Cancer: Correlation with Clinical Response to Gefitinib Therapy , 2004, Science.

[37]  Ken Chen,et al.  SomaticSniper: identification of somatic point mutations in whole genome sequencing data , 2012, Bioinform..

[38]  M. Gerstung,et al.  Reliable detection of subclonal single-nucleotide variants in tumour cell populations , 2012, Nature Communications.

[39]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[40]  H. Hakonarson,et al.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing , 2013, Genome Medicine.

[41]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[42]  Sohrab P. Shah,et al.  JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data , 2012, Bioinform..

[43]  Terence P. Speed,et al.  Comparing somatic mutation-callers: beyond Venn diagrams , 2013, BMC Bioinformatics.

[44]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[45]  Andrew Menzies,et al.  Subclonal diversification of primary breast cancer revealed by multiregion sequencing , 2015, Nature Medicine.

[46]  R. Daniel Kortschak,et al.  A comparative analysis of algorithms for somatic SNV detection in cancer , 2013, Bioinform..

[47]  Michael C. Heinold,et al.  A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing , 2015, Nature Communications.